advanced algorithms: text algorithmssommer/aa10/aa11.pdf · advanced algorithms / t. shibuya...
Post on 17-Oct-2020
29 Views
Preview:
TRANSCRIPT
Advanced Algorithms / T. ShibuyaAdvanced Algorithms / T. Shibuya
Advanced Algorithms:
Text Algorithms
Tetsuo Shibuya
Human Genome Center, Institute of Medical Science
(Adjunct at Department of Computer Science)
University of Tokyo
http://www.hgc.jp/~tshibuya
Advanced Algorithms / T. Shibuya
Self Introduction
Affilation:Laboratory of Sequence Analysis, Human Genome Center,
Institute of Medical Science
Research InterestBioinformatics algorithms
Our lab is located at the 4th floor
Advanced Algorithms / T. Shibuya
The topics of this lecture
Today (July 2nd)
Text searching algorithmsKnuth-Morris-Pratt / Boyer-Moore / etc
Next week (July 9th)
Text indexing algorithmsSuffix arrays and their applications
The final week (July 16th)
Text compression algorithmsLZ77 / LZ78 / LZW / Arithmetic coding / Block sorting /etc
Advanced Algorithms / T. Shibuya
Reports
Please submit a report for the homework that I will give on the last day
or
Please submit scribe notes for one of my three lectures
In TeX format as for the previous lectures
One volunteer (if any) for one lecture
The submitted notes will be put on the web page
Advanced Algorithms / T. Shibuya
Textbooks
D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997.
The most famous book on text processing algorithms, but many parts are out of date.
W. Sung, Algorithms in Bioinformatics, CRC Press, 2009. Good introduction for bioinformatics algorithms (mainly on text processing)
D. Salomon, Data Compression, 3rd Edition, Springer, 2004.
Related to the topic on the last day. (Very heavy book!)
Advanced Algorithms / T. Shibuya
Today's topic
Text processing algorithms
Brute-force algorithm
Knuth-Morris-Pratt algorithm
Colussi algorithm
Aho-Corasick algorithm
Boyer-Moore algorithm
Horspool algorithm
Turbo-BM algorithm
Rabin-Karp algorithm
Shift-Or method
etc.
Advanced Algorithms / T. Shibuya
Text matching
Problem
Given Text string T and a pattern (query) P
Output Substrings of T that are exactly same as P, if any.
exact matching: no insertion / deletion / modification(mutation)
Two approaches
Preprocess only the query pattern (today)
Preprocess the text beforehand (next week)
GGTGAGAAGTTATGATACAGGGTAGTTG
TGTCCTTAAGGTGTATAACGATGACATC
ACAGGCAGCTCTAATCTCTTGCTATGAG
TGATGTAAGATTTATAAGTACGCAAATT
TATAA
Text
Pattern (Query)
Advanced Algorithms / T. Shibuya
Two types of text matching algorithms
Skipping positions unnecessary to compareCheck from left
Knuth-Morris-Pratt
Aho-Corasick (for multiple queries)
Check from rightBoyer-Moore, Horspool, Turbo-BM
Brute-forceNaive algorithm
Fingerprinting (Hash-based) algorithmRabin-Karp
Bitwise computation-based algorithmShift-Or (Shift-And)
Advanced Algorithms / T. Shibuya
Naive algorithm
Just check one by one at each positionO(nm) in the worst case, but...
Linear time in average!Not so bad for cases when you have no time to implement:-)
But still it's much slower than other sophisticated algorithms in practice.
TextGGGACCAAGTTCCGCACATGCCGGATAGAAT
c
c
c
c
CCg
....
CCGt
....
Average length to check1+1/4+(1/4)2+... = 4/3 (constant!)
(for random DNA sequence)
CCGTATG
Pattern
Check one by one
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt(1)
Improvement of the brute-force algorithm
The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time
→ Knuth-Morris-Pratt Algorithm
TAGTAGC
Pattern
Check from left
AATACTAGTAGGCATGCCGGAT
t
t
TAg
t
t
TAGTAGc
t
t
TAGt
...
skip
Text
skip
We already know the text is "TAGTAG" and cannot match with the pattern in these positions before comparison
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt (2)
P[0..i] matches the text but P[i+1] does not, then
FailureLink[i+1]= max j s.t. P[0..j]≡P[i-j..i], P[j+1]≠P[i+1], and j <i if no such j exists, let FL[i]=-2 if P[i+1]=P[0], otherwise let FL[i]=-1.
FailureLink[i] can be computed before searching the text!
We can skip i+1-FailureLink[i+1] characters
Should be different(←Knuth)
Longest match with the prefixFailed matching HERE
Skip!
Falure Link
You don't have to check these positions again!
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt (3)
CTACTGATCTGATCGCTAGATGC
CTGATCTGC
CTGATCTGC
CTGATCTGC
CTGATCTGC
CTGATCGCMP skips only 4 positions KMP skips 5 positions
Text
Pattern
Skip 1 position
Failed at the first position, so just proceed
Overlap of "CTG"
No overlap
Advanced Algorithms / T. Shibuya
Knuth-Morris-Pratt (4)
Preprocessing
A naive algorithm requires O(m2) or even O(m3) time
Linear time algorithm exists
Use the KMP itself
Z algorithm [Gusfield 97]
Not faster than the KMP, but easier to understand
Advanced Algorithms / T. Shibuya
Z Algorithm (1)
Zi
Compute it for all i
Longest common prefix length of S[0..n-1] and S[i..n-1]
righti
Max value of x+Zx-1 (x<i )
lefti
x that takes the maximum value of x+Zx-1 (x<i )
Initialization
Z1=right1=left1=0
i
Zilefti righti
Zleft_i
0 Zi
Z box
Advanced Algorithms / T. Shibuya
Z Algorithm (2)
Computation of Zi +1
In case i +1≤righti
We have already computed until the position righti
In case Zi < righti -i , we can copy the answer in O(1)
Otherwise compare naively after the position righti ― ①
In case i +1>righti
Compare naively ― ②
①+② can be done in linear time in total!
Z Algorithm itself is also a text matching algorithm
Compute Zi against P$T
P: pattern, T: text, $: some character that is not in P nor T
i
Zilefti righti
Zleft_i
i'
Zi+1=Zi'+1
Zleft_i
i'+1 i+1
righti-lefti0
Advanced Algorithms / T. Shibuya
Z Algorithm (3)
Example
ATGCGCATAATGCGCTGAATGGCCATAATCTGAA
0000002016000000013000002012000011
We have done to this position
Let's compute Zi for this position!
Zi
Text
rightleftSame text
Just copy the numbers if the numbers are smaller than 3
Advanced Algorithms / T. Shibuya
Zi & Failure Link (F[])
Zii
if (FailureLink[i+Zi] = -1) FailureLink[i+Zi] = Zi -1
Compute in this order
Failure links can be obtained by just scanning the ZiTable Initialize FailureLink[] with -1
pattern GTAGGCATGTAGCGTAGG
i 0123456789........
Zi 000110004001030011
Flink AAAB00AABAAB3BAA20A: -1 B: -2
Knuth's rule (post processing)
Advanced Algorithms / T. Shibuya
Computational Time Complexity of KMP
O(m+n)
n: text length, m: pattern length
Worst-case time complexity#comparison < 2n-1
Practically, it's not faster than the Boyer-Moore or Shift-Or algorithms in ordinary
though these algorithms does not achieve the worst-case linear time
Advanced Algorithms / T. Shibuya
Colussi Algorithm (A Variation of the KMP)
#comparison < 3n/2 Check the positions with the KMP strong rule later Skip lengths are different from KMP Preprocessing is also in linear time Practically not so faster, though cf. Galil-Giancarlo algorithms achieves #comparison < 4n/3
Step 1
FailureLink[i]+1( )
G a t G c t c a t G A T G t c c G A T G C c G t
0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 5 1-1 -1 -1 -1
Step 2
G a t G c t c a t G A T G t c c G A T G C c G t
Check in this order
Strong rule
Advanced Algorithms / T. Shibuya
KMP and an automaton
A T A T T G
Failure Link
KMP can be described by an automaton
Advanced Algorithms / T. Shibuya
Aho-Corasick (1)
The automaton can be extended to deal with multiple queries!Linear time construction!
Linear time searching!
Failure Link
A
T
T
C
CG
T
T
GC
T TLink to the root if not specified
Advanced Algorithms / T. Shibuya
Aho-Corasick (2)
Construction of the keyword treeO(M) time
M: Sum of query string lengths
Can be used for dictionary searching
A
T
T
C
CG
T
T
GC
T T
Advanced Algorithms / T. Shibuya
Aho-Corasick (3)
Breadth-first searching
Start from the root
No failure link at the root
FailureLink(v)
Traverse FailureLinks of v'sparent to find a node that have a child w with the same label, and let (the nearest) w be FailureLink(v)
If no such node exists, let FailureLink(v) = root
a
b
a
c
a
b
v
w
Advanced Algorithms / T. Shibuya
Aho-Corasick (4)
Why it is linear time?
failure links to be made
1 shorter suffix
root
All the suffixes of some pattern
Existing paths from the root in the tree
traverse at most O(m) nodes
Advanced Algorithms / T. Shibuya
Aho-Corasick (5)
OutLink(v)
Pointers to the nodes with the alphabet thatv must outputs
Computation of OutLink()
Traverse the failure links to find a leaf if any
If there's no such leaf, there's no need to set the outlink
Also in linear time
1 together2 ether3 get4 her5 he
t
o g e t h e r
e t h e r
h e r
g
e t
1
2
4
5
3
Advanced Algorithms / T. Shibuya
Regular expression search based on automata (1)
Regular expression
Concatenation A, B → AB
Or A, B → A+B
Repeat A → A*
Extension of Aho-Corasick
AB(A+B)(AB+CD)*B
ABABABBBABAABBABACDBABBABBABBCDBABAABABBABAABCDB...
Advanced Algorithms / T. Shibuya
Regular expression search based on automata (2)
Construct the automaton for a regular expression
(A*B+AC)D
AB A+B A*A
B
ε Next
A B
Next
A
ε
Next
A
ε
D
C
B
A
εEnd
Start
Advanced Algorithms / T. Shibuya
Regular expression search based on automata (3)
A
ε
D
C
B
A
εEnd
Start
0 4
1
2
3
5 6
7 8
O(nm)
CDAABCAAABDDACDAAC
000000000000000000
113 11137 1 11
55 555 567556
8 8
You can start anywhere
Reachable nodes
DP
(Not including εstates)
Found!
Advanced Algorithms / T. Shibuya
Boyer-Moore (1)
Idea
Almost the same as KMP, but check from right!
Practically faster than KMPGood average-case time complexity
Bad worst-case time complexity
AATTGTTCCGGCCATGCCGGAT
......T
.....TT
....GTT
...cGTT failed
gtt...t failed
....g.t failed
GTTCGTT
Skip based on the information of "GTT"
Skip based on the information of "G"
Text
Pattern
Advanced Algorithms / T. Shibuya
Boyer-Moore (2)
Two rules Bad character rule
If the character at the failed position is x, we can move the last x in the pattern to the position
The algorithm that uses only this rule is called Horspool Algorithm
(Strong) Good suffix rule
Strong: the character before the same substring must be different This constraint was not used in the original BM algorithm
cf. Knuth's rule in KMP
Do the larger shift of the above two
Failed SuccessFailed
Success
Different = strong
Advanced Algorithms / T. Shibuya
Boyer-Moore (3)
Bad character rule example
TTCCAAGTCGCCPattern
Do not consider the last character
CCCTGTCCATGCCGTCAGCCC
TTCCAAGTCGCC
TTCCAAGTCGCC
Failed
Last T
Text
Advanced Algorithms / T. Shibuya
Boyer-Moore (4)
(Strong) Good suffix rule example
CGTATATCCAATATCPattern
AGTCCCTCGGTCCGATATCGACCCTCCCG
CGTATATCCAATATC
CGTATATCCAATATC
TextFailed
Advanced Algorithms / T. Shibuya
Boyer-Moore (5)
Preprocess
Bad character ruleVery easy
Good suffix rule
Linear time by using the Z algorithm from backward
Advanced Algorithms / T. Shibuya
Boyer-Moore (6)
Computational time complexityAverage-case O(n/min (m, alphabet size))
i.e., average-case skip length is O(min(m, alphabet size))
Horspool algorithm has the same time complexity
Worst-case O(nm)
Bad for cases:
Many repeats
» KMP is faster
Small alphabet size
» Shift-Or is faster
Linear time for finding only 1 occurrence
Good for grep in editors
Worst-case O(n) algorithms based on BM
Turbo-BM (Crochemore et al. '92), Galil (1979), Smyth (2000), Apostolico-Giancarlo (1986), etc.
Advanced Algorithms / T. Shibuya
Turbo-BM
Turbo-shift
Additional rule that can be applied for a new shift after the strong good suffix rule-based shift
A=B、but ① ≠ ② ,so B cannot overlap with
w a b c z a b c w a b c a b c a b c
a b ca b c a b ca b c
xy(≠x)
zw
w(≠z)
z
y w z x w z a b ca b c a b ca b c
strong good suffix rule
strong good suffix rule
turbo-shift
Text
Pattern
Previous shift
x
② ¬z
y
Next position
+ Consider bad character rule too.
Failed
Failed
① z
A B
B
A
Previous Current Next
Advanced Algorithms / T. Shibuya
Rabin-Karp (1)
Based on fingerprinting (i.e., hashing)
e.g.,hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q
Pattern p → hash(p)
Text
hash(T[0..|P|-1])
hash(T[1..|P|])
hash(T[2..|P|+1])
compare with hash(p) at firstO(1) computation for each
q : some prime number
Advanced Algorithms / T. Shibuya
Rabin-Karp (2)
10111(16+4+2+1) mod 5 = 3
Pattern
11001101110100101...Text
check → YES!
check → NO
(16+8+1) mod 5 = 0
((0-1·16)·2+1) mod 5 = 4
((4-1·16)·2+0) mod 5 = 1
((1-0·16)·2+1) mod 5 = 3
((3-0·16)·2+1) mod 5 = 2
((2-1·16)·2+1) mod 5 = 3
O(1)
O(1)
O(1)
O(1)
O(1)
Advanced Algorithms / T. Shibuya
Shift-And Method
Bit-parallel (32 or 64) computation!
Efficient for small alphabet size-case
ACGT
T 0001
T 0001
A 1000
T 0001
T 0001
G 0010
C 0100
G 0010
Bit representation
1 if matched
X (shift (X, 1bit) or 1) and BA
TTTACGTATTATTACGTCC..
T 01110001011011000100..
T 00110000001001000000..
A 00001000000100100000..
T 00000000000010000000..
T 00000000000001000000..
G 00000000000000100000..
C 00000000000000010000..
G 00000000000000001000..
Text
パタン
Start from 0
Advanced Algorithms / T. Shibuya
Shift-Or Method
Just reverse the bits!
((001001 << 1) OR 000001) AND 010010vs.
(110100 << 1 ) OR 1011011.5 times faster?!
Advanced Algorithms / T. Shibuya
Summary
String searching algorithms
Brute-forceNaive, Rabin-Karp, Shift-Or
From leftKMP, AC
From rightBM, Horspool, Turbo-BM
Next week
Suffix arrays
top related