![Page 1: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/1.jpg)
Advanced Algorithms / T. ShibuyaAdvanced Algorithms / T. Shibuya
Advanced Algorithms:
Text Algorithms
Tetsuo Shibuya
Human Genome Center, Institute of Medical Science
(Adjunct at Department of Computer Science)
University of Tokyo
http://www.hgc.jp/~tshibuya
![Page 2: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/2.jpg)
Advanced Algorithms / T. Shibuya
Self Introduction
Affilation:Laboratory of Sequence Analysis, Human Genome Center,
Institute of Medical Science
Research InterestBioinformatics algorithms
Our lab is located at the 4th floor
![Page 3: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/3.jpg)
Advanced Algorithms / T. Shibuya
The topics of this lecture
Today (July 2nd)
Text searching algorithmsKnuth-Morris-Pratt / Boyer-Moore / etc
Next week (July 9th)
Text indexing algorithmsSuffix arrays and their applications
The final week (July 16th)
Text compression algorithmsLZ77 / LZ78 / LZW / Arithmetic coding / Block sorting /etc
![Page 4: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/4.jpg)
Advanced Algorithms / T. Shibuya
Reports
Please submit a report for the homework that I will give on the last day
or
Please submit scribe notes for one of my three lectures
In TeX format as for the previous lectures
One volunteer (if any) for one lecture
The submitted notes will be put on the web page
![Page 5: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/5.jpg)
Advanced Algorithms / T. Shibuya
Textbooks
D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997.
The most famous book on text processing algorithms, but many parts are out of date.
W. Sung, Algorithms in Bioinformatics, CRC Press, 2009. Good introduction for bioinformatics algorithms (mainly on text processing)
D. Salomon, Data Compression, 3rd Edition, Springer, 2004.
Related to the topic on the last day. (Very heavy book!)
![Page 6: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/6.jpg)
Advanced Algorithms / T. Shibuya
Today's topic
Text processing algorithms
Brute-force algorithm
Knuth-Morris-Pratt algorithm
Colussi algorithm
Aho-Corasick algorithm
Boyer-Moore algorithm
Horspool algorithm
Turbo-BM algorithm
Rabin-Karp algorithm
Shift-Or method
etc.
![Page 7: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/7.jpg)
Advanced Algorithms / T. Shibuya
Text matching
Problem
Given Text string T and a pattern (query) P
Output Substrings of T that are exactly same as P, if any.
exact matching: no insertion / deletion / modification(mutation)
Two approaches
Preprocess only the query pattern (today)
Preprocess the text beforehand (next week)
GGTGAGAAGTTATGATACAGGGTAGTTG
TGTCCTTAAGGTGTATAACGATGACATC
ACAGGCAGCTCTAATCTCTTGCTATGAG
TGATGTAAGATTTATAAGTACGCAAATT
TATAA
Text
Pattern (Query)
![Page 8: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/8.jpg)
Advanced Algorithms / T. Shibuya
Two types of text matching algorithms
Skipping positions unnecessary to compareCheck from left
Knuth-Morris-Pratt
Aho-Corasick (for multiple queries)
Check from rightBoyer-Moore, Horspool, Turbo-BM
Brute-forceNaive algorithm
Fingerprinting (Hash-based) algorithmRabin-Karp
Bitwise computation-based algorithmShift-Or (Shift-And)
![Page 9: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/9.jpg)
Advanced Algorithms / T. Shibuya
Naive algorithm
Just check one by one at each positionO(nm) in the worst case, but...
Linear time in average!Not so bad for cases when you have no time to implement:-)
But still it's much slower than other sophisticated algorithms in practice.
TextGGGACCAAGTTCCGCACATGCCGGATAGAAT
c
c
c
c
CCg
....
CCGt
....
Average length to check1+1/4+(1/4)2+... = 4/3 (constant!)
(for random DNA sequence)
CCGTATG
Pattern
Check one by one
![Page 10: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/10.jpg)
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt(1)
Improvement of the brute-force algorithm
The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time
→ Knuth-Morris-Pratt Algorithm
TAGTAGC
Pattern
Check from left
AATACTAGTAGGCATGCCGGAT
t
t
TAg
t
t
TAGTAGc
t
t
TAGt
...
skip
Text
skip
We already know the text is "TAGTAG" and cannot match with the pattern in these positions before comparison
![Page 11: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/11.jpg)
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt (2)
P[0..i] matches the text but P[i+1] does not, then
FailureLink[i+1]= max j s.t. P[0..j]≡P[i-j..i], P[j+1]≠P[i+1], and j <i if no such j exists, let FL[i]=-2 if P[i+1]=P[0], otherwise let FL[i]=-1.
FailureLink[i] can be computed before searching the text!
We can skip i+1-FailureLink[i+1] characters
Should be different(←Knuth)
Longest match with the prefixFailed matching HERE
Skip!
Falure Link
You don't have to check these positions again!
![Page 12: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/12.jpg)
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt (3)
CTACTGATCTGATCGCTAGATGC
CTGATCTGC
CTGATCTGC
CTGATCTGC
CTGATCTGC
CTGATCGCMP skips only 4 positions KMP skips 5 positions
Text
Pattern
Skip 1 position
Failed at the first position, so just proceed
Overlap of "CTG"
No overlap
![Page 13: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/13.jpg)
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt (4)
Preprocessing
A naive algorithm requires O(m2) or even O(m3) time
Linear time algorithm exists
Use the KMP itself
Z algorithm [Gusfield 97]
Not faster than the KMP, but easier to understand
![Page 14: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/14.jpg)
Advanced Algorithms / T. Shibuya
Z Algorithm (1)
Zi
Compute it for all i
Longest common prefix length of S[0..n-1] and S[i..n-1]
righti
Max value of x+Zx-1 (x<i )
lefti
x that takes the maximum value of x+Zx-1 (x<i )
Initialization
Z1=right1=left1=0
i
Zilefti righti
Zleft_i
0 Zi
Z box
![Page 15: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/15.jpg)
Advanced Algorithms / T. Shibuya
Z Algorithm (2)
Computation of Zi +1
In case i +1≤righti
We have already computed until the position righti
In case Zi < righti -i , we can copy the answer in O(1)
Otherwise compare naively after the position righti ― ①
In case i +1>righti
Compare naively ― ②
①+② can be done in linear time in total!
Z Algorithm itself is also a text matching algorithm
Compute Zi against P$T
P: pattern, T: text, $: some character that is not in P nor T
i
Zilefti righti
Zleft_i
i'
Zi+1=Zi'+1
Zleft_i
i'+1 i+1
righti-lefti0
![Page 16: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/16.jpg)
Advanced Algorithms / T. Shibuya
Z Algorithm (3)
Example
ATGCGCATAATGCGCTGAATGGCCATAATCTGAA
0000002016000000013000002012000011
We have done to this position
Let's compute Zi for this position!
Zi
Text
rightleftSame text
Just copy the numbers if the numbers are smaller than 3
![Page 17: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/17.jpg)
Advanced Algorithms / T. Shibuya
Zi & Failure Link (F[])
Zii
if (FailureLink[i+Zi] = -1) FailureLink[i+Zi] = Zi -1
Compute in this order
Failure links can be obtained by just scanning the ZiTable Initialize FailureLink[] with -1
pattern GTAGGCATGTAGCGTAGG
i 0123456789........
Zi 000110004001030011
Flink AAAB00AABAAB3BAA20A: -1 B: -2
Knuth's rule (post processing)
![Page 18: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/18.jpg)
Advanced Algorithms / T. Shibuya
Computational Time Complexity of KMP
O(m+n)
n: text length, m: pattern length
Worst-case time complexity#comparison < 2n-1
Practically, it's not faster than the Boyer-Moore or Shift-Or algorithms in ordinary
though these algorithms does not achieve the worst-case linear time
![Page 19: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/19.jpg)
Advanced Algorithms / T. Shibuya
Colussi Algorithm (A Variation of the KMP)
#comparison < 3n/2 Check the positions with the KMP strong rule later Skip lengths are different from KMP Preprocessing is also in linear time Practically not so faster, though cf. Galil-Giancarlo algorithms achieves #comparison < 4n/3
Step 1
FailureLink[i]+1( )
G a t G c t c a t G A T G t c c G A T G C c G t
0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 5 1-1 -1 -1 -1
Step 2
G a t G c t c a t G A T G t c c G A T G C c G t
Check in this order
Strong rule
![Page 20: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/20.jpg)
Advanced Algorithms / T. Shibuya
KMP and an automaton
A T A T T G
Failure Link
KMP can be described by an automaton
![Page 21: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/21.jpg)
Advanced Algorithms / T. Shibuya
Aho-Corasick (1)
The automaton can be extended to deal with multiple queries!Linear time construction!
Linear time searching!
Failure Link
A
T
T
C
CG
T
T
GC
T TLink to the root if not specified
![Page 22: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/22.jpg)
Advanced Algorithms / T. Shibuya
Aho-Corasick (2)
Construction of the keyword treeO(M) time
M: Sum of query string lengths
Can be used for dictionary searching
A
T
T
C
CG
T
T
GC
T T
![Page 23: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/23.jpg)
Advanced Algorithms / T. Shibuya
Aho-Corasick (3)
Breadth-first searching
Start from the root
No failure link at the root
FailureLink(v)
Traverse FailureLinks of v'sparent to find a node that have a child w with the same label, and let (the nearest) w be FailureLink(v)
If no such node exists, let FailureLink(v) = root
a
b
a
c
a
b
v
w
![Page 24: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/24.jpg)
Advanced Algorithms / T. Shibuya
Aho-Corasick (4)
Why it is linear time?
failure links to be made
1 shorter suffix
root
All the suffixes of some pattern
Existing paths from the root in the tree
traverse at most O(m) nodes
![Page 25: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/25.jpg)
Advanced Algorithms / T. Shibuya
Aho-Corasick (5)
OutLink(v)
Pointers to the nodes with the alphabet thatv must outputs
Computation of OutLink()
Traverse the failure links to find a leaf if any
If there's no such leaf, there's no need to set the outlink
Also in linear time
1 together2 ether3 get4 her5 he
t
o g e t h e r
e t h e r
h e r
g
e t
1
2
4
5
3
![Page 26: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/26.jpg)
Advanced Algorithms / T. Shibuya
Regular expression search based on automata (1)
Regular expression
Concatenation A, B → AB
Or A, B → A+B
Repeat A → A*
Extension of Aho-Corasick
AB(A+B)(AB+CD)*B
ABABABBBABAABBABACDBABBABBABBCDBABAABABBABAABCDB...
![Page 27: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/27.jpg)
Advanced Algorithms / T. Shibuya
Regular expression search based on automata (2)
Construct the automaton for a regular expression
(A*B+AC)D
AB A+B A*A
B
ε Next
A B
Next
A
ε
Next
A
ε
D
C
B
A
εEnd
Start
![Page 28: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/28.jpg)
Advanced Algorithms / T. Shibuya
Regular expression search based on automata (3)
A
ε
D
C
B
A
εEnd
Start
0 4
1
2
3
5 6
7 8
O(nm)
CDAABCAAABDDACDAAC
000000000000000000
113 11137 1 11
55 555 567556
8 8
You can start anywhere
Reachable nodes
DP
(Not including εstates)
Found!
![Page 29: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/29.jpg)
Advanced Algorithms / T. Shibuya
Boyer-Moore (1)
Idea
Almost the same as KMP, but check from right!
Practically faster than KMPGood average-case time complexity
Bad worst-case time complexity
AATTGTTCCGGCCATGCCGGAT
......T
.....TT
....GTT
...cGTT failed
gtt...t failed
....g.t failed
GTTCGTT
Skip based on the information of "GTT"
Skip based on the information of "G"
Text
Pattern
![Page 30: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/30.jpg)
Advanced Algorithms / T. Shibuya
Boyer-Moore (2)
Two rules Bad character rule
If the character at the failed position is x, we can move the last x in the pattern to the position
The algorithm that uses only this rule is called Horspool Algorithm
(Strong) Good suffix rule
Strong: the character before the same substring must be different This constraint was not used in the original BM algorithm
cf. Knuth's rule in KMP
Do the larger shift of the above two
Failed SuccessFailed
Success
Different = strong
![Page 31: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/31.jpg)
Advanced Algorithms / T. Shibuya
Boyer-Moore (3)
Bad character rule example
TTCCAAGTCGCCPattern
Do not consider the last character
CCCTGTCCATGCCGTCAGCCC
TTCCAAGTCGCC
TTCCAAGTCGCC
Failed
Last T
Text
![Page 32: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/32.jpg)
Advanced Algorithms / T. Shibuya
Boyer-Moore (4)
(Strong) Good suffix rule example
CGTATATCCAATATCPattern
AGTCCCTCGGTCCGATATCGACCCTCCCG
CGTATATCCAATATC
CGTATATCCAATATC
TextFailed
![Page 33: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/33.jpg)
Advanced Algorithms / T. Shibuya
Boyer-Moore (5)
Preprocess
Bad character ruleVery easy
Good suffix rule
Linear time by using the Z algorithm from backward
![Page 34: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/34.jpg)
Advanced Algorithms / T. Shibuya
Boyer-Moore (6)
Computational time complexityAverage-case O(n/min (m, alphabet size))
i.e., average-case skip length is O(min(m, alphabet size))
Horspool algorithm has the same time complexity
Worst-case O(nm)
Bad for cases:
Many repeats
» KMP is faster
Small alphabet size
» Shift-Or is faster
Linear time for finding only 1 occurrence
Good for grep in editors
Worst-case O(n) algorithms based on BM
Turbo-BM (Crochemore et al. '92), Galil (1979), Smyth (2000), Apostolico-Giancarlo (1986), etc.
![Page 35: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/35.jpg)
Advanced Algorithms / T. Shibuya
Turbo-BM
Turbo-shift
Additional rule that can be applied for a new shift after the strong good suffix rule-based shift
A=B、but ① ≠ ② ,so B cannot overlap with
w a b c z a b c w a b c a b c a b c
a b ca b c a b ca b c
xy(≠x)
zw
w(≠z)
z
y w z x w z a b ca b c a b ca b c
strong good suffix rule
strong good suffix rule
turbo-shift
Text
Pattern
Previous shift
x
② ¬z
y
Next position
+ Consider bad character rule too.
Failed
Failed
① z
A B
B
A
Previous Current Next
![Page 36: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/36.jpg)
Advanced Algorithms / T. Shibuya
Rabin-Karp (1)
Based on fingerprinting (i.e., hashing)
e.g.,hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q
Pattern p → hash(p)
Text
hash(T[0..|P|-1])
hash(T[1..|P|])
hash(T[2..|P|+1])
compare with hash(p) at firstO(1) computation for each
q : some prime number
![Page 37: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/37.jpg)
Advanced Algorithms / T. Shibuya
Rabin-Karp (2)
10111(16+4+2+1) mod 5 = 3
Pattern
11001101110100101...Text
check → YES!
check → NO
(16+8+1) mod 5 = 0
((0-1·16)·2+1) mod 5 = 4
((4-1·16)·2+0) mod 5 = 1
((1-0·16)·2+1) mod 5 = 3
((3-0·16)·2+1) mod 5 = 2
((2-1·16)·2+1) mod 5 = 3
O(1)
O(1)
O(1)
O(1)
O(1)
![Page 38: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/38.jpg)
Advanced Algorithms / T. Shibuya
Shift-And Method
Bit-parallel (32 or 64) computation!
Efficient for small alphabet size-case
ACGT
T 0001
T 0001
A 1000
T 0001
T 0001
G 0010
C 0100
G 0010
Bit representation
1 if matched
X (shift (X, 1bit) or 1) and BA
TTTACGTATTATTACGTCC..
T 01110001011011000100..
T 00110000001001000000..
A 00001000000100100000..
T 00000000000010000000..
T 00000000000001000000..
G 00000000000000100000..
C 00000000000000010000..
G 00000000000000001000..
Text
パタン
Start from 0
![Page 39: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/39.jpg)
Advanced Algorithms / T. Shibuya
Shift-Or Method
Just reverse the bits!
((001001 << 1) OR 000001) AND 010010vs.
(110100 << 1 ) OR 1011011.5 times faster?!
![Page 40: Advanced Algorithms: Text Algorithmssommer/aa10/aa11.pdf · Advanced Algorithms / T. Shibuya Textbooks D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997](https://reader033.vdocuments.net/reader033/viewer/2022053023/6056c6f29cf107319568f497/html5/thumbnails/40.jpg)
Advanced Algorithms / T. Shibuya
Summary
String searching algorithms
Brute-forceNaive, Rabin-Karp, Shift-Or
From leftKMP, AC
From rightBM, Horspool, Turbo-BM
Next week
Suffix arrays