cs262 lecture 9, win07, batzoglou rapid global alignments how to align genomic sequences in (more or...
Post on 19-Dec-2015
215 views
TRANSCRIPT
CS262 Lecture 9, Win07, Batzoglou
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
CS262 Lecture 9, Win07, Batzoglou
CS262 Lecture 9, Win07, Batzoglou
Saving cells in DP
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
CS262 Lecture 9, Win07, Batzoglou
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
CS262 Lecture 9, Win07, Batzoglou
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
x = x1, …, xm
• Find a subsequence
s = s1, …, sk
s1 < s2 < … < sk
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
Let input be w: w1,…, wn
INITIALIZATION:L: last LIS elt. array L[0] = -inf
L[1] = w1 L[2…n] = +inf
B: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far
ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]
B[j] iP[i] B[j – 1]
}
That’s it!!!• Running time?
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point (i, j), is inserted into w as follows:
• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
a is before b in w, and y < y’
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (y, x) < (y’, x’) if y < y’
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L = [L1] [L2] [L3] [L4] [L5] …
1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)
Longest common subsequence:s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj: smallest (North) to largest (South) value
L is implemented as a balanced binary tree
y
h
l
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score
V(b)V(a)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
i
j
k
Is k ever removed?
CS262 Lecture 9, Win07, Batzoglou
Example
x
y
a: 5
c: 3
b: 6
d: 4e: 2
2
56
9101112141516
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
a b c d eV
5
L
li
V(i)
i
5
5
a
8
11
8
c
11 12
9
11
b
15
12
d
13
16
13
3
CS262 Lecture 9, Win07, Batzoglou
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
CS262 Lecture 9, Win07, Batzoglou
Examples
Human Genome BrowserABC
CS262 Lecture 9, Win07, Batzoglou
Gene Recognition
CS262 Lecture 9, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
CS262 Lecture 9, Win07, Batzoglou
Where are the genes?Where are the genes?Where are the genes?Where are the genes?
CS262 Lecture 9, Win07, Batzoglou
CS262 Lecture 9, Win07, Batzoglou
Needles in a Haystack
CS262 Lecture 9, Win07, Batzoglou
• Classes of Gene predictors Ab initio
• Only look at the genomic DNA of target genome De novo
• Target genome + aligned informant genome(s)
EST/cDNA-based & combined approaches• Use aligned ESTs or cDNAs + any other kind of evidence
Gene Finding
EXON EXON EXON EXON EXON
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
CS262 Lecture 9, Win07, Batzoglou
Signals for Gene Finding
1. Regular gene structure
2. Exon/intron lengths
3. Codon composition
4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites
5. Patterns of conservation
6. Sequenced mRNAs
7. (PCR for verification)
CS262 Lecture 9, Win07, Batzoglou
Next Exon:Frame 0
Next Exon:Frame 1
CS262 Lecture 9, Win07, Batzoglou
Exon and Intron Lengths
CS262 Lecture 9, Win07, Batzoglou
Nucleotide Composition
• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
CS262 Lecture 9, Win07, Batzoglou
atgatg
tgatga
ggtgagggtgag
ggtgagggtgag
ggtgagggtgag
caggtgcaggtg
cagatgcagatg
cagttgcagttg
caggcccaggccggtgagggtgag
CS262 Lecture 9, Win07, Batzoglou
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Intergene State
Intergene State
First Exon State
First Exon State
IntronStateIntronState
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
exon exon exonintronintronintergene intergene
Intergene State
Intergene State
First Exon State
First Exon State
IntronStateIntronState
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
CS262 Lecture 9, Win07, Batzoglou
Duration HMMs for Gene Recognition
TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
Duration d
iPINTRON(xi | xi-1…xi-w)
PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)
j+2
P5’SS(xi-3…xi+4)
PSTOP(xi-4…xi+3)
CS262 Lecture 9, Win07, Batzoglou
Genscan
• Burge, 1997
• First competitive HMM-based gene finder, huge accuracy jump
• Only gene finder at the time, to predict partial genes and genes in both strands
Features– Duration HMM– Four different parameter sets
• Very low, low, med, high GC-content
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
• Hox cluster is an example where everything is conserved
CS262 Lecture 9, Win07, Batzoglou
Patterns of Conservation
30% 1.3%
0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
CS262 Lecture 9, Win07, Batzoglou
Comparison-based Gene Finders
• Rosetta, 2000• CEM, 2000
– First methods to apply comparative genomics (human-mouse) to improve gene prediction
• Twinscan, 2001– First HMM for comparative gene prediction in two genomes
• SLAM, 2002– Generalized pair-HMM for simultaneous alignment and gene
prediction in two genomes
• NSCAN, 2006– Best method to-date based on a phylo-HMM for multiple genome
gene prediction
CS262 Lecture 9, Win07, Batzoglou
Twinscan
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns
ExampleHuman: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|
CS262 Lecture 9, Win07, Batzoglou
SLAM – Generalized Pair HMM
d
e
Exon GPHMM1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
• GENSCAN
• TWINSCAN
• N-SCAN
Target GGTGAGGTGACCAAGAACGTGTTGACAGTATarget GGTGAGGTGACCAAGAACGTGTTGACAGTA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA
...
),...,,...,|( 1 oiioiii TTP III
),...,|( 1 oiii TTTP
),...,,,...,|,( 11 oiioiiii TTTP III
Target sequence:
Informant sequences (vector):
Joint prediction (use phylo-HMM):
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
XX
C YY
ZZ H
M R
)|()|()|(
)|()|()|()(
),,,,,,(
1
ZRPZMPYZP
YHPXYPXCPAP
ZYXRMCHP
XX
C
YY
ZZ
H
M R
)|()|()|(
)|()|()|()(
),,,,,,(
ZRPZMPXCP
YZPYXPHYPHP
ZYXRMCHP
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison
GENSCANGeneralized HMMModels human sequence
TWINSCANGeneralized HMMModels human/mouse alignments
N-SCANPhylo-HMMModels multiple sequence evolution
GENSCANGeneralized HMMModels human sequence
TWINSCANGeneralized HMMModels human/mouse alignments
N-SCANPhylo-HMMModels multiple sequence evolution
NSCAN human/mouse
>Human/multiple
informants
CS262 Lecture 9, Win07, Batzoglou
• 2-level architecture• No Phylo-HMM that models alignments
CONTRAST
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
SVMSVM SVMSVM
X
Y
a b a b
CS262 Lecture 9, Win07, Batzoglou
CONTRAST
CS262 Lecture 9, Win07, Batzoglou
• log P(y | x) ~ wTF(x, y)
• F(x, y) = i f(yi-1, yi, i, x)
• f(yi-1, yi, i, x):
1{yi-1 = INTRON, yi = EXON_FRAME_1}
1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT)
1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC)
(1-c)1{a<SVM_DONOR(i)<b} (optional) 1{EXON_FRAME_1, EST_EVIDENCE}
CONTRAST - Features
CS262 Lecture 9, Win07, Batzoglou
• Accuracy increases as we add informants
• Diminishing returns after ~5 informants
CONTRAST – SVM accuracies
SN SP
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Decoding
Viterbi Decoding:
maximize P(y | x)
Maximum Expected Boundary Accuracy Decoding:
maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)
Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Training
Maximum Conditional Likelihood Training:
maximize L(w) = Pw(y | x)
Maximum Expected Boundary Accuracy Training:
ExpectedBoundaryAccuracy(w) = i Accuracyi
Accuracyi = B 1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -
B’ ≠ B P(yi-1, yi is exon boundary B’ | x)
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison
De NovoDe Novo
EST-assistedEST-assisted
HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken
HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison