reading dna sequences: 18-th century mathematics for 21-st century
TRANSCRIPT
![Page 1: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/1.jpg)
Reading DNA Sequences:
18-th Century Mathematics for
21-st Century Technology
Michael Waterman University of Southern California
Tsinghua University
![Page 2: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/2.jpg)
![Page 3: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/3.jpg)
DNA
• Genetic information of an organism
• Double helix, complementary base pairs
• A pairs T; G pairs C
• E. coli, a bacterium, has 5 million base pairs
• Humans have 3 (6) billion base pairs per nucleus;
equal to about 1 yard of DNA molecule
![Page 4: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/4.jpg)
Outline
I. 20-th Century DNA Sequencing History
II. DNA Sequence Assembly
Shotgun Assembly
III. New Generation SequencingTechnologies
Technologies
Shotgun Assembly
![Page 5: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/5.jpg)
I. DNA SEQUENCING HISTORY
• Sanger receives a 1958 Nobel Prize for sequencing insulin, a protein
• Sanger and Gilbert receive the 1980 Nobel prize for DNA sequencing methods
![Page 6: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/6.jpg)
...history repeats itself:
sequencing insulin
Fred Sanger1958 (!) Nobel prize for sequencing insulin by Edman degradation
Average read length = 5 aa!
![Page 7: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/7.jpg)
![Page 8: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/8.jpg)
• Clone and amplify the target DNA
to obtain a large number of identical molecules
• Attach primer to one end of single stranded DNA
• Polymerase extends from labeled primer
• Four separate reactions, for each letter A,T,G,C
• Extension proceeds as normal until a chain terminating dideoxynucleotide is incorporated
![Page 9: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/9.jpg)
![Page 10: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/10.jpg)
![Page 11: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/11.jpg)
Gearing Up
• Automated sequencing machines
• Label became attached fluorescent dyes, one color for each terminating base
• Reaction could be run in the same chamber and only use one column of the gel
• Detection of bases became image processing
• Caltech origins: Lee Hood, Lloyd Smith, Hunkapilar brothers
![Page 12: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/12.jpg)
![Page 13: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/13.jpg)
![Page 14: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/14.jpg)
![Page 15: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/15.jpg)
II. DNA SEQUENCE ASSEMBLY
When rapid DNA sequencing technology came from Sanger’s laboratory at Cambridge in the MRC in 1976,
Sanger also recruited Roger Staden who created the first assembly program.
![Page 16: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/16.jpg)
Fragment Assembly
In shotgun sequencing, whole genomes are sequenced by making clones, breaking them into small pieces, and trying to put the pieces together again based on overlaps.
Note that the fragments are randomly sampled, and thus no positional information is available.
!"#$%
![Page 17: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/17.jpg)
Whole Genome Assembly: problem description
• The goal is to reconstruct an unknown source sequence (the genome) on {A, C, G, T} given many random short segments from the sequence, the shotgun reads.
• A read is a sequence of nucleotides of length 30-800, taken from a random place in the genome.
• Reads contain two kinds of errors: base substitutions and indels. Base substitutions occur with a frequency from 0.5 – 2%. Indels occur roughly 10 times less frequently.
• Strand orientation is unknown.
![Page 18: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/18.jpg)
Assembly is challenging
![Page 19: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/19.jpg)
Shortest Superstring
• Merge two strings with largest overlap, continue (greedy algorithm)
• A nice computer science problem
• Conjecture: GREEDY IS AT WORST 2 TIMES OPTIMAL
• The best known constant is 2.5
• Analysis using word periodicity--not simple
![Page 20: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/20.jpg)
Even worse…
• We must deal with significant errors in the sequence reads.
• The orientation of each read is unknown.
• Genomes have many repeats (approximate copies of the same sequence), which are very hard to identify and reconstruct.
• Gaps due to low coverage
• The size of the problem is very large. Celera’s Human Genome sequencing project contained roughly 26.4 million reads, each about 550 bases long.
![Page 21: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/21.jpg)
Repeats
• Repeats: A major problem for fragment assembly
• > 50% of human genome is repeats:
- over 1 million Alu repeats (about 300 bp)
- about 200,000 LINE repeats (1000 bp and longer)
![Page 22: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/22.jpg)
Overlap-Layout-Consensus Assembler programs: ARACHNE, PHRAP, CAP, TIGR, CELERACommon Approach:
![Page 23: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/23.jpg)
Overlap-Layout-Consensus Assembler programs: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Common Approach:
![Page 24: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/24.jpg)
Overlap-Layout-Consensus Assembler programs: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and contigs into supercontigs
Common Approach:
![Page 25: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/25.jpg)
Overlap-Layout-Consensus Assembler programs: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and contigs into supercontigs
Consensus: derive the DNA sequence and correct read errors &&'()'**'(''*'))**&&
Common Approach:
![Page 26: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/26.jpg)
!"#$%&'()*+',-&-.*/(./(+*$#(0#-&.%
!"#$%&'1(./)%,0#(+.1+&-)2#0(%#--#$1(&1(3#%%(&1(+.11./4(*/#1(5./0#%167
8&)2('&.$(*9($#&01(2&1(-2#('*11.:.%.-;(-2#;(&$#(<$#&0=(-2#(1&+#(0.$#)-.*/(*$(/*-7
>&$#9,%()*+'&$.1*/1(-&?#()*+',-./4(-.+#('$*'*$-.*/&%(-*(5%#/4-25&6@%#/4-25:66
(((
![Page 27: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/27.jpg)
Human Genome example
• The Celera project
• 26.5*106 reads of length 550
• Pairs of reads = 7*1014
• One pair takes 3*105 units
• Total resources = 2*1020
![Page 28: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/28.jpg)
![Page 29: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/29.jpg)
SUMMARY
• Computer assembly of genomes is challenging
• The computer programs used in the HGP were sophisticated upgrades of Staden’s approach
• Much work and ingenuity went into these computational projects
![Page 30: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/30.jpg)
IV. NEW
• Novel 21-st Century technologies, highly parallel
• The era of $1000 genomes is coming!
![Page 31: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/31.jpg)
New Generation Sequencing
![Page 32: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/32.jpg)
• > 20 million bases• 100 bp reads• 200,000+ clonal reads • Single 5-hour run
15 September 2005
Volume 437 Number 7057 pp376-380
![Page 33: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/33.jpg)
PicoTiterPlates™
• Multiple optical fibers are fused to form an optical array
• Selective removal of core material leaves wells that serve as ‘test tubes’
• Reactions occurring in the ‘test tubes’ can be monitored optically, through the remaining fiber
• Well diameter: 44!
• Current plate contains 1.6M wells
![Page 34: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/34.jpg)
Process Overview
1) Prepare Adapter Ligated ssDNA Library
2) Clonal Amplification on 28 ! beads
4) Perform Sequencing by synthesison the 454 Instrument
3) Load beads and enzymes in PicoTiter Plate™
![Page 35: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/35.jpg)
![Page 36: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/36.jpg)
454 Technology - Sequencing
Instrument
![Page 37: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/37.jpg)
Typical Run Results
• 30MB per run
• ~300,000 reads
• ~100 bp per read
0%
15%
30%
45%
60%
0 1 2 3 4 5 6 7 8 9 10+*
% o
f re
ad
s
# of errors per read
0
3750
7500
11250
15000
41 51 61 71 81 91 101 111 121 131
# o
f re
ad
s
*10+ includes partial and unmapped reads
• 1-2% avg. error
• ~50% error-free
![Page 38: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/38.jpg)
![Page 39: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/39.jpg)
![Page 40: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/40.jpg)
![Page 41: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/41.jpg)
![Page 42: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/42.jpg)
![Page 43: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/43.jpg)
Overlap--Layout-Consensus
• Even the overlap step requires an impossible number of pairwise comparisons
For example (109)*(109) = 1018,
for a single Illumina machine in a single day
![Page 44: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/44.jpg)
Overlap-Layout-Consensus with short reads
• Finding consensus is NP hard
• Number of short reads is large, and tiling of each overlap is very small => tremendous runtime.
![Page 45: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/45.jpg)
IV. EULER’S GRAPHS
We begin with Euler’s original 1736 insight.
![Page 46: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/46.jpg)
The Bridges of Konigsberg
![Page 47: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/47.jpg)
The Konigsberg Bridges Graph
![Page 48: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/48.jpg)
The Konigsberg Bridges
A
B
C
D
EF
G
![Page 49: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/49.jpg)
AB
C
DE
F
G
H$.04#1(&$#(I*./#0(:;(#04#1(.9(-2#;()&/(:#($#&)2#0(:;(3&%?./4(*/(%&/07
![Page 50: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/50.jpg)
If the nodes are bridges, it is a Hamilton tour problem to visit each vertex exactly once and is NP hard.
Instead Euler in 1736 turned the problem inside out!
He made bridges into edges and his problem is to visit all edges (once and only once) returning to the starting location.
![Page 51: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/51.jpg)
• An Eulerian path is one which visits each edge once and only once.
• An Eulerian circuit is an Eulerian path which starts and ends at the same vertex. This gets Euler back home.
![Page 52: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/52.jpg)
The Konigsberg Bridges
A
B
C
D
EF
G
![Page 53: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/53.jpg)
>
H
JK
C
G
A
B
DE
F
L2#(G(:$.04#1(&$#(-2#(#04#1M(-2#(%&/0(:*0.#1(&$#(-2#("#$-.).#17
![Page 54: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/54.jpg)
• There is at least one Eulerian circuit if and only if the graph is connected and, for every vertex v, the degree is even (and positive).
• For directed graphs, in(v) = out(v).
• There is beautiful combinatorics to give the
number of Eulerian paths (and for uniqueness).
![Page 55: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/55.jpg)
>
H
JK
C
G
A
B
DE
F
8"#$;("#$-#N(.1(*9(*00(0#4$##O/*(8,%#$.&/().$),.-17
![Page 56: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/56.jpg)
Algorithms to produce an Eulerian circuit are quite elementary:
• Start anywhere
• Follow any edge, marking edges as you go
• If you return to a vertex with no remaining unmarked edges, just go to any unexplored edge . If none exist you are finished.
![Page 57: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/57.jpg)
IVb. De Bruijn Sequences
Our alphabet will be {A, G, C, T}
A De Bruijn sequence is a cyclic sequence where all 4k k-words appear exactly once.
They exist in the shortest possible length sequence a la Euler as follows.
![Page 58: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/58.jpg)
• Verticies are all sequences of length k-1
• Directed edges exist where the k-2 suffix of one vertex is the k-2 prefix on the other vertex
• Edges correspond to k-words
• Every vertex has 4 incoming edges and 4 outgoing edges, in(v)=4=out(v)
![Page 59: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/59.jpg)
An Example
ACCTGA CCTGAT
• k=7, k-1=6, and k-2=5
• The k-word or edge is ACCTGAT
• Many Eulerian circuits exist
![Page 60: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/60.jpg)
De Bruijn sequences go back to Sainte-Marie 1894
• Each vertex is a (k-1)-word
• Each vertex has 4 edges entering and 4 leaving
• The number of DeBruijn sequences is
4! 4^ (k ! 1)/4k
(for k=3, this is 2*1020)
![Page 61: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/61.jpg)
Sequence Graphs
For a specific sequence build the graph as above.Example: DNA = ATGTGCCGCA, k=3
Determining the (or a) sequence from the graph is equivalent to finding an Eulerian path.
![Page 62: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/62.jpg)
Sequence Graphs
For a specific sequence build the graph as above.Example: DNA = ATGTGCCGCA, k=3
Determining the (or a) sequence from the graph is equivalent to finding an Eulerian path.
JL LP
![Page 63: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/63.jpg)
Sequence Graphs
For a specific sequence build the graph as above.Example: DNA = ATGTGCCGCA, k=3
Determining the (or a) sequence from the graph is equivalent to finding an Eulerian path.
JL LP
PL
![Page 64: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/64.jpg)
Sequence Graphs
For a specific sequence build the graph as above.Example: DNA = ATGTGCCGCA, k=3
Determining the (or a) sequence from the graph is equivalent to finding an Eulerian path.
JL LP
PL
P>
>P>>
>J
![Page 65: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/65.jpg)
JL LP
PL
P>
>P>>
>J
![Page 66: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/66.jpg)
JL LP
PL
P>
>P>>
>J
JLP
![Page 67: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/67.jpg)
JL LP
PL
P>
>P>>
>J
JLP
![Page 68: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/68.jpg)
JL LP
PL
P>
>P>>
>J
JLPQ>
![Page 69: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/69.jpg)
JL LP
PL
P>
>P>>
>J
JLPQ>
![Page 70: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/70.jpg)
JL LP
PL
P>
>P>>
>J
JLPQ>QJ
![Page 71: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/71.jpg)
JL LP
PL
P>
>P>>
>J
JLPQ>>J
![Page 72: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/72.jpg)
JL LP
PL
P>
>P>>
>J
JLPQ>>PJ
![Page 73: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/73.jpg)
JL LP
PL
P>
>P>>
>J
JLPQ>>P>J
![Page 74: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/74.jpg)
JL LP
PL
P>
>P>>
>J
JLPL>>P>J
![Page 75: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/75.jpg)
JL LP
PL
P>
>P>>
>J
JLPLP>>P>J
![Page 76: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/76.jpg)
JL LP
PL
P>
>P>>
>J
JLPLP>>P>J
![Page 77: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/77.jpg)
IVc. SEQUENCING BY HYBRIDIZATION
• This method was proposed in 1988 and 1989.
• While never achieving its initial goal,
the arrays which were created are in widespread usage today.
![Page 78: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/78.jpg)
Sequencing By Hybridization (SBH)
DNA array -- all possible oligonucleotides of length k (k-tuples)
target sequence -- labeled single stranded DNA fragment
hybridization -- target DNA fragment hybridizes with oligonucleotides complementary to k-tuples of the target fragment
sequence reconstruction -- reconstruct the sequence of the target DNA from its spectrum (k-tuple content)
![Page 79: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/79.jpg)
Example: DNA=ATGTGCCGCA, k=3
![Page 80: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/80.jpg)
Hamiltonian path approach
Lysov et al.(1988), Drmanac et al.(1989):
Transform spectrum S into a graph H
Vertices in H -- k-tuples in the spectrum
S Edges in H -- overlapping k-tuples.(
![Page 81: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/81.jpg)
Sequence reconstruction (Hamiltonian)
Represent each k-tuple by a vertex and draw a directed edge from vertex a1…al to vertex b1…bl iff a2…al = b1…bl-1.Example: DNA = ATGTGCCGCA, l=3
Determining the sequence is equivalent to finding a Hamiltonian tour (NP-hard).
![Page 82: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/82.jpg)
Sequence reconstruction (Eulerian path)
Represent l-tuples by edges and (l-1)-tuples by vertices such that an edge a1…al is directed from the vertex a1…al-1 to the vertex a2…al.Example: DNA = ATGTGCCGCA, l=3
Determining a sequence consistent with the data is equivalent to finding an Eulerian circuit. (Pevzner)
![Page 83: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/83.jpg)
![Page 84: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/84.jpg)
IVd. EULERIAN ASSEMBLY
• Take the reads
• Break up reads into overlapping k-words
• (k=25, say)
• Merge identical k-words
• Find Eulerian paths
![Page 85: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/85.jpg)
Eulerian path approach
Sequence reconstruction is the search for Eulerian paths in graphs from the k-word data.
Idury and Waterman (1995)
Edge in graph G is a k-word in one or more reads
![Page 86: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/86.jpg)
Euler Approach to Assembly
Vertices: (k-1)-words in each read
Edges: k-words in each read
Euler graph
![Page 87: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/87.jpg)
Euler approach: advantages
• Linear time
• No pairwise alignment
• No layout
• No consensus or multiple alignment
• One letter indels are no harder to handle than mismatches
![Page 88: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/88.jpg)
• Orientation: the usual difficulty of deciding a consistent orientation for each fragment is handled by putting in the reverse complement for each fragment
• Computation time is double that if one knew exactly the orientation of each read
(A big advantage over an exponential # of possibilities!)
![Page 89: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/89.jpg)
Euler approach: difficulties
• Erroneous edges
• Tangled graph
• Storage requirements (huge)
![Page 90: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/90.jpg)
Erroneous edges
X
error
Erroneous edges
![Page 91: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/91.jpg)
![Page 92: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/92.jpg)
Detecting Chimerical Reads
Multiplicity
k-word position
![Page 93: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/93.jpg)
Tangled graph
How can we simplify the tangled Eulerian graph?
![Page 94: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/94.jpg)
Eulerian Superpath Problem
Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths.
![Page 95: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/95.jpg)
Solving Eulerian Superpath Problem:
Equivalent Transformations
!Simplify the system of path with the goal of transforming the Eulerian Superpath Problem into the Eulerian Path Problem.
!Equivalent transformation of the repeat graph: there exists a one-to-one correspondence between Eulerian superpaths before and after the transformation
!Make a series of equivalent transformations
that lead to a system of paths with every path being a single edge in the repeat graph
![Page 96: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/96.jpg)
Repeats
• Repeats: A major problem for fragment assembly
• > 50% of human genome is repeats:
- over 1 million Alu repeats (about 300 bp)
- about 200,000 LINE repeats (1000 bp and longer)
![Page 97: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/97.jpg)
![Page 98: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/98.jpg)
TECHNOLOGY CONTINUES TO EVOLVE
• Every new technology can make extinct a set of computational methods.
• Every new technology is likely to create a new suite of problems.
![Page 99: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/99.jpg)
>*%%&:*$&-*$1R
S&+&/&(T0,$;(5AUUE6
V&"#%(V#"W/#$(&/0(X&.N,(L&/4(5BYYA6
Z.&*+&/([.(5BYYC6
\,(]2&/4(5BYYDM(BYYE6
![Page 100: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/100.jpg)
![Page 101: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/101.jpg)
![Page 102: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century](https://reader031.vdocuments.net/reader031/viewer/2022020703/61fb31cf2e268c58cd5b46dc/html5/thumbnails/102.jpg)
THANKS FOR
LISTENING!