advanced algorithms: text algorithmssommer/aa10/aa11.pdf · advanced algorithms / t. shibuya...

Advanced Algorithms / T. ShibuyaAdvanced Algorithms / T. Shibuya

Advanced Algorithms:

Text Algorithms

Tetsuo Shibuya

Human Genome Center, Institute of Medical Science

(Adjunct at Department of Computer Science)

University of Tokyo

http://www.hgc.jp/~tshibuya

Advanced Algorithms / T. Shibuya

Self Introduction

Affilation:Laboratory of Sequence Analysis, Human Genome Center,

Institute of Medical Science

Research InterestBioinformatics algorithms

Our lab is located at the 4th floor

The topics of this lecture

Today (July 2nd)

Text searching algorithmsKnuth-Morris-Pratt / Boyer-Moore / etc

Next week (July 9th)

Text indexing algorithmsSuffix arrays and their applications

The final week (July 16th)

Text compression algorithmsLZ77 / LZ78 / LZW / Arithmetic coding / Block sorting /etc

Reports

Please submit a report for the homework that I will give on the last day

Please submit scribe notes for one of my three lectures

In TeX format as for the previous lectures

One volunteer (if any) for one lecture

The submitted notes will be put on the web page

Textbooks

D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997.

The most famous book on text processing algorithms, but many parts are out of date.

W. Sung, Algorithms in Bioinformatics, CRC Press, 2009. Good introduction for bioinformatics algorithms (mainly on text processing)

D. Salomon, Data Compression, 3rd Edition, Springer, 2004.

Related to the topic on the last day. (Very heavy book!)

Today's topic

Text processing algorithms

Brute-force algorithm

Knuth-Morris-Pratt algorithm

Colussi algorithm

Aho-Corasick algorithm

Boyer-Moore algorithm

Horspool algorithm

Turbo-BM algorithm

Rabin-Karp algorithm

Shift-Or method

Text matching

Problem

Given Text string T and a pattern (query) P

Output Substrings of T that are exactly same as P, if any.

exact matching： no insertion / deletion / modification(mutation)

Two approaches

Preprocess only the query pattern (today)

Preprocess the text beforehand (next week)

GGTGAGAAGTTATGATACAGGGTAGTTG

TGTCCTTAAGGTGTATAACGATGACATC

ACAGGCAGCTCTAATCTCTTGCTATGAG

TGATGTAAGATTTATAAGTACGCAAATT

Pattern (Query)

Two types of text matching algorithms

Skipping positions unnecessary to compareCheck from left

Knuth-Morris-Pratt

Aho-Corasick (for multiple queries)

Check from rightBoyer-Moore, Horspool, Turbo-BM

Brute-forceNaive algorithm

Fingerprinting (Hash-based) algorithmRabin-Karp

Bitwise computation-based algorithmShift-Or (Shift-And)

Naive algorithm

Just check one by one at each positionO(nm) in the worst case, but...

Linear time in average!Not so bad for cases when you have no time to implement:-)

But still it's much slower than other sophisticated algorithms in practice.

TextGGGACCAAGTTCCGCACATGCCGGATAGAAT

Average length to check1+1/4+(1/4)2+... = 4/3 (constant!)

(for random DNA sequence)

CCGTATG

Pattern

Check one by one

Knuth-Morris-Pratt(1)

Improvement of the brute-force algorithm

The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time

→ Knuth-Morris-Pratt Algorithm

TAGTAGC

Pattern

Check from left

AATACTAGTAGGCATGCCGGAT

TAGTAGc

We already know the text is "TAGTAG" and cannot match with the pattern in these positions before comparison

Knuth-Morris-Pratt (2)

P[0..i] matches the text but P[i+1] does not, then

FailureLink[i+1]= max j s.t. P[0..j]≡P[i-j..i], P[j+1]≠P[i+1], and j <i if no such j exists, let FL[i]=-2 if P[i+1]=P[0], otherwise let FL[i]=-1.

FailureLink[i] can be computed before searching the text!

We can skip i+1-FailureLink[i+1] characters

Should be different（←Knuth）

Longest match with the prefixFailed matching HERE

Falure Link

You don't have to check these positions again!

CTACTGATCTGATCGCTAGATGC

CTGATCTGC

CTGATCGCMP skips only 4 positions KMP skips 5 positions

Pattern

Skip 1 position

Failed at the first position, so just proceed

Overlap of "CTG"

No overlap

Preprocessing

A naive algorithm requires O(m2) or even O(m3) time

Linear time algorithm exists

Use the KMP itself

Z algorithm [Gusfield 97]

Not faster than the KMP, but easier to understand

Z Algorithm (1)

Compute it for all i

Longest common prefix length of S[0..n-1] and S[i..n-1]

righti

Max value of x+Zx-1 (x<i )

x that takes the maximum value of x+Zx-1 (x<i )

Initialization

Z1=right1=left1=0

Zilefti righti

Zleft_i

Z Algorithm (2)

Computation of Zi +1

In case i +1≤righti

We have already computed until the position righti

In case Zi < righti -i , we can copy the answer in O(1)

Otherwise compare naively after the position righti ― ①

In case i +1>righti

Compare naively ― ②

①+② can be done in linear time in total!

Z Algorithm itself is also a text matching algorithm

Compute Zi against P$T

P: pattern, T: text, $: some character that is not in P nor T

Zilefti righti

Zleft_i

Zi+1=Zi'+1

Zleft_i

i'+1 i+1

righti-lefti0

Z Algorithm (3)

Example

ATGCGCATAATGCGCTGAATGGCCATAATCTGAA

0000002016000000013000002012000011

We have done to this position

Let's compute Zi for this position!

rightleftSame text

Just copy the numbers if the numbers are smaller than 3

Zi & Failure Link (F[])

if (FailureLink[i+Zi] = -1) FailureLink[i+Zi] = Zi -1

Compute in this order

Failure links can be obtained by just scanning the ZiTable Initialize FailureLink[] with -1

pattern GTAGGCATGTAGCGTAGG

i 0123456789........

Zi 000110004001030011

Flink AAAB00AABAAB3BAA20A: -1 B: -2

Knuth's rule (post processing)

Computational Time Complexity of KMP

O(m+n)

n: text length, m: pattern length

Worst-case time complexity#comparison < 2n-1

Practically, it's not faster than the Boyer-Moore or Shift-Or algorithms in ordinary

though these algorithms does not achieve the worst-case linear time

Colussi Algorithm (A Variation of the KMP)

#comparison < 3n/2 Check the positions with the KMP strong rule later Skip lengths are different from KMP Preprocessing is also in linear time Practically not so faster, though cf. Galil-Giancarlo algorithms achieves #comparison < 4n/3

Step 1

FailureLink[i]+1（）

G a t G c t c a t G A T G t c c G A T G C c G t

0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 5 1-1 -1 -1 -1

Step 2

G a t G c t c a t G A T G t c c G A T G C c G t

Check in this order

Strong rule

KMP and an automaton

A T A T T G

Failure Link

KMP can be described by an automaton

Aho-Corasick (1)

The automaton can be extended to deal with multiple queries!Linear time construction!

Linear time searching!

Failure Link

T TLink to the root if not specified

Aho-Corasick (2)

Construction of the keyword treeO(M) time

M: Sum of query string lengths

Can be used for dictionary searching

Aho-Corasick (3)

Breadth-first searching

Start from the root

No failure link at the root

FailureLink(v)

Traverse FailureLinks of v'sparent to find a node that have a child w with the same label, and let (the nearest) w be FailureLink(v)

If no such node exists, let FailureLink(v) = root

Aho-Corasick (4)

Why it is linear time?

failure links to be made

1 shorter suffix

All the suffixes of some pattern

Existing paths from the root in the tree

traverse at most O(m) nodes

Aho-Corasick (5)

OutLink(v)

Pointers to the nodes with the alphabet thatv must outputs

Computation of OutLink()

Traverse the failure links to find a leaf if any

If there's no such leaf, there's no need to set the outlink

Also in linear time

1 together2 ether3 get4 her5 he

o g e t h e r

e t h e r

Regular expression search based on automata (1)

Regular expression

Concatenation A, B → AB

Or A, B → A+B

Repeat A → A*

Extension of Aho-Corasick

AB(A+B)(AB+CD)*B

ABABABBBABAABBABACDBABBABBABBCDBABAABABBABAABCDB...

Construct the automaton for a regular expression

(A*B+AC)D

AB A+B A*A

ε Next

CDAABCAAABDDACDAAC

000000000000000000

113 11137 1 11

55 555 567556

You can start anywhere

Reachable nodes

(Not including εstates)

Found!

Boyer-Moore (1)

Almost the same as KMP, but check from right!

Practically faster than KMPGood average-case time complexity

Bad worst-case time complexity

AATTGTTCCGGCCATGCCGGAT

......T

.....TT

....GTT

...cGTT failed

gtt...t failed

....g.t failed

GTTCGTT

Skip based on the information of "GTT"

Skip based on the information of "G"

Pattern

Boyer-Moore (2)

Two rules Bad character rule

If the character at the failed position is x, we can move the last x in the pattern to the position

The algorithm that uses only this rule is called Horspool Algorithm

(Strong) Good suffix rule

Strong: the character before the same substring must be different This constraint was not used in the original BM algorithm

cf. Knuth's rule in KMP

Do the larger shift of the above two

Failed SuccessFailed

Success

Different = strong

Boyer-Moore (3)

Bad character rule example

TTCCAAGTCGCCPattern

Do not consider the last character

CCCTGTCCATGCCGTCAGCCC

TTCCAAGTCGCC

Failed

Last T

Boyer-Moore (4)

(Strong) Good suffix rule example

CGTATATCCAATATCPattern

AGTCCCTCGGTCCGATATCGACCCTCCCG

CGTATATCCAATATC

TextFailed

Boyer-Moore (5)

Preprocess

Bad character ruleVery easy

Good suffix rule

Linear time by using the Z algorithm from backward

Boyer-Moore (6)

Computational time complexityAverage-case O(n/min (m, alphabet size))

i.e., average-case skip length is O(min(m, alphabet size))

Horspool algorithm has the same time complexity

Worst-case O(nm)

Bad for cases:

Many repeats

» KMP is faster

Small alphabet size

» Shift-Or is faster

Linear time for finding only 1 occurrence

Good for grep in editors

Worst-case O(n) algorithms based on BM

Turbo-BM (Crochemore et al. '92), Galil (1979), Smyth (2000), Apostolico-Giancarlo (1986), etc.

Turbo-BM

Turbo-shift

Additional rule that can be applied for a new shift after the strong good suffix rule-based shift

A=B、but ① ≠ ② ,so B cannot overlap with

w a b c z a b c w a b c a b c a b c

a b ca b c a b ca b c

xy(≠x)

w(≠z)

y w z x w z a b ca b c a b ca b c

strong good suffix rule

turbo-shift

Pattern

Previous shift

② ¬z

Next position

+ Consider bad character rule too.

Failed

Previous Current Next

Rabin-Karp (1)

Based on fingerprinting (i.e., hashing)

e.g.,hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q

Pattern p → hash(p)

hash(T[0..|P|-1])

hash(T[1..|P|])

hash(T[2..|P|+1])

compare with hash(p) at firstO(1) computation for each

q : some prime number

Rabin-Karp (2)

10111(16+4+2+1) mod 5 = 3

Pattern

11001101110100101...Text

check → YES!

check → NO

(16+8+1) mod 5 = 0

((0-1·16)·2+1) mod 5 = 4

((4-1·16)·2+0) mod 5 = 1

((1-0·16)·2+1) mod 5 = 3

((3-0·16)·2+1) mod 5 = 2

((2-1·16)·2+1) mod 5 = 3

Shift-And Method

Bit-parallel (32 or 64) computation!

Efficient for small alphabet size-case

T 0001

A 1000

T 0001

G 0010

C 0100

G 0010

Bit representation

1 if matched

X (shift (X, 1bit) or 1) and BA

TTTACGTATTATTACGTCC..

T 01110001011011000100..

T 00110000001001000000..

A 00001000000100100000..

T 00000000000010000000..

T 00000000000001000000..

G 00000000000000100000..

C 00000000000000010000..

G 00000000000000001000..

パタン

Start from 0

Shift-Or Method

Just reverse the bits!

((001001 << 1) OR 000001) AND 010010vs.

(110100 << 1 ) OR 1011011.5 times faster?!

Summary

String searching algorithms

Brute-forceNaive, Rabin-Karp, Shift-Or

From leftKMP, AC

From rightBM, Horspool, Turbo-BM

Next week

Suffix arrays

advanced algorithms: text algorithmssommer/aa10/aa11.pdf · advanced algorithms / t. shibuya...

Documents

1. 080918 aa11 check in -...

opere provvisionali aa10-11 -...

algorithms: greedy algorithms

cs 4407 algorithms greedy algorithms

advanced algorithms: text compression algorithms

branching algorithms - inria · branching algorithms are...

architettura degli elaboratori -...

lake horrabridge - webs · aa1 aa2 aa3 aa4 aa5 aa6 aa7 aa8...

relazione ponteggi aa10-11 -...

aa11 alemlube automotive tyre changer

aa10 evidencia 10 comprensión de la vigilancia...

introduction to algorithms graph algorithms

classe cin nom & prénom groupe...classe cin nom & prénom...

g roup 2 1. nurulain bt roslan (aa10214) 2. nor zubaidah bt...

key words. ams subject classiﬁcations....

portal.igg.ac.mnportal.igg.ac.mn/dataset/84c0270f-f0a2-4fe7-aa10...ohb1...

diabetes treatment algorithms treatment algorithms

第5 章日本測地系と世界 ... -...

architettura degli elaboratori -...

introduction to algorithms -...