![Page 1: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/1.jpg)
Exact String SearchExact String Search
Lecture 7: September 22, 2005
Algorithms in Biosequence AnalysisNathan Edwards - Fall, 2005
![Page 2: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/2.jpg)
2
Boyer-Moore
• Method of choice for exact string search, for a single pattern• Typically, examines fewer than m
characters of the text (sublinear time)• Linear worst case running time• Conceptually very similar to K-M-P, but
more complicated to running time proof• Empirically, better for english text than
DNA sequence
![Page 3: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/3.jpg)
3
Boyer-Moore
• Three key ideas• Right to left scan• Bad character rule• (Strong) good suffix rule
• The combination of these ideas can produce large pattern shifts.
• Provable O(n+m) running time when pattern is not in the text• need extension for case when pattern is in
the text to achieve linear running time.
![Page 4: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/4.jpg)
4
Right to left scan / bad character rule
0 1 12345678901234567T:xpbctbxabpqxctbpqP: tpabxab *^^^^
![Page 5: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/5.jpg)
5
Right to left scan / bad character rule
0 1 12345678901234567T:xpbctbxabpqxctbpqP: tpabxab *^^^^P: tpabxab *
![Page 6: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/6.jpg)
6
Right to left scan / bad character rule
0 1 123456789012345678T:xpbctbxabpqxctbpqzP: tpabxab *^^^^P: tpabxab *P: tpabxab
![Page 7: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/7.jpg)
7
Bad character rule
Comparing r-to-l, mismatch at i of P, k of T:
If T(k) is absent from Pshift left end of P to k+1 of T
If right-most T(k) in P is to left of ishift pattern to align T(k) characters
Otherwiseshift pattern 1 position
![Page 8: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/8.jpg)
8
Right to left scan / bad character rule
0 1 12345678901234567T:xpbctbaabpqxctbpqP: tpabxab *^^
![Page 9: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/9.jpg)
9
Right to left scan / bad character rule
0 1 12345678901234567T:xpbctbaabpqxctbpqP: tpabxab *^^
![Page 10: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/10.jpg)
10
Extended bad character rule
Comparing r-to-l, mismatch at i of P, k of T:
If T(k) is absent from P[1…i-1]shift left end of P to k+1 of T
For right-most T(k) in P to left of ishift pattern to align T(k) characters
Otherwiseshift pattern 1 position
![Page 11: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/11.jpg)
11
Right to left scan / extended bad character rule
0 1 12345678901234567T:xpbctbaabpqxctbpqP: tpabxab *^^
![Page 12: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/12.jpg)
12
Right to left scan / extended bad character rule
0 1 12345678901234567T:xpbctbaabpqxctbpqP: tpabxab
![Page 13: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/13.jpg)
13
(Extended) bad character rule
• For all x in Σ, R(x) is the position of the right-most occurrence of x in P. R(x) is zero if x is absent from P.
• Comp. r-to-l, mismatch i of P, k of T: shift P right max[1,i-R(T(k))] positions
• For extended bad character rule, need to lookup R(x,i)
![Page 14: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/14.jpg)
14
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *
![Page 15: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/15.jpg)
15
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^ P: qcabdabdab
![Page 16: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/16.jpg)
16
(Strong) good suffix rule
0 1 123456789012345678T:prstabstudabvqxrstP: abdubdab *^^^
![Page 17: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/17.jpg)
17
(Strong) good suffix rule
0 1 123456789012345678T:prstabstudabvqxrstP: abdubdab *^^^ P: abdabdab
![Page 18: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/18.jpg)
18
(Strong) good suffix rule
Substring t of T matches suffix of P:• Find the right-most copy t’ in P
s.t. t’ is not a suffix of P and char to left of t’ in P ≠ char to left of t in Pshift P to align t’ in P with t in T
• If no such t’ shift P so that the longest proper prefix of P aligns with suffix of P
![Page 19: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/19.jpg)
19
(Stong) good suffix rule
Definitions: L(i) – max j < n such that P[i…n] matches suffix of P[1…j], 0 if no such j.L’(i) – max j < n such that P[i…n] matches suffix of P[1…j] and char. before suffix ≠ P(i-1), 0 if no such j.
Weak and strong shifts for first part of good suffix rule.
![Page 20: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/20.jpg)
20
Computing L’(i)
Definition:
Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P.
compare with:
Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.
![Page 21: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/21.jpg)
21
Computing L’(i)
Definition:
Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P.
(!) compare with:
Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.
Compute Nj(P) as Zn-j+1(reverse(P)).
![Page 22: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/22.jpg)
22
Computing L’(i)
• L’(i) – max j < n s.t. Nj(P) = |P[i…n]| = (n – i +1)
![Page 23: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/23.jpg)
23
(Strong) good suffix rule
Definition:l’(i) – length of the longest prefix of P
that is also a suffix of P[i…n], 0 if no such prefix exists.
l’(i) – max j < (n – i + 1) s.t. Nj(P) = j
![Page 24: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/24.jpg)
24
Boyer-Moore psuedo code
Compute L’(i), l’(i), and R(x) for x in Σ.k = nwhile k ≤ n i = n, h = k while i > 0 and P(i) = T(h) i--; h-- if i = 0 occurrence of P in T k = k + n – l’(2) else If L’(i+1) > 0, λ = L’(i+1), λ = l’(i+1) k = k + max{ 1, i - R(T(h)), n – λ }
![Page 25: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/25.jpg)
25
Running time analysis
• Notice that unlike K-M-P, we might re-compare text characters that matched in a previous iteration.
• Worst instance does Θ(nm) total comparisons, but only if P is in T
• If P is not in T, O(n+m) running time• complicated proof!
• What goes wrong when P is in T?
![Page 26: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/26.jpg)
26
Worst case instance, P in T
0 1 12345678901234567T:aaaaaaaaaaaaaaaaaP: aaaaaaa ^^^^^^^P: aaaaaaa ^^^^^^^
![Page 27: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/27.jpg)
27
Galil’s Extention
• Comparing r-to-l, n of P aligned to k of T, matched at character s of T: If pos 1 of P shifts past s, thenprefix of P matches in T up to pos k.• skip these comparisons
• Sufficient for linear time bound, whether or not P is in T or not.
![Page 28: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/28.jpg)
28
Worst case instance, P in T
0 1 12345678901234567T:aaaaaaaaaaaaaaaaaP: aaaaaaa ^^^^^^^P: aaaaaaa ^
![Page 29: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/29.jpg)
29
Galil’s Extention
0 1 123456789012345678T:prstabstudabvqxrstP: abdubdab *^^^ P: abdabdab
![Page 30: Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d135503460f949e6e38/html5/thumbnails/30.jpg)
30
Lessons From B-M
• Sub-linear time is possible• But we still need to read T from disk!
• Bad cases require periodicity in P or T• matching random P with T is easy!
• Large alphabets mean large shifts• Small alphabets make complicated
shift data-structures possible• B-M better for “english” and amino-
acids than for DNA.