advanced algorithms: text algorithmssommer/aa10/aa11.pdf · advanced algorithms / t. shibuya...

Post on 17-Oct-2020

29 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Advanced Algorithms / T. ShibuyaAdvanced Algorithms / T. Shibuya

Advanced Algorithms:

Text Algorithms

Tetsuo Shibuya

Human Genome Center, Institute of Medical Science

(Adjunct at Department of Computer Science)

University of Tokyo

http://www.hgc.jp/~tshibuya

Advanced Algorithms / T. Shibuya

Self Introduction

Affilation:Laboratory of Sequence Analysis, Human Genome Center,

Institute of Medical Science

Research InterestBioinformatics algorithms

Our lab is located at the 4th floor

Advanced Algorithms / T. Shibuya

The topics of this lecture

Today (July 2nd)

Text searching algorithmsKnuth-Morris-Pratt / Boyer-Moore / etc

Next week (July 9th)

Text indexing algorithmsSuffix arrays and their applications

The final week (July 16th)

Text compression algorithmsLZ77 / LZ78 / LZW / Arithmetic coding / Block sorting /etc

Advanced Algorithms / T. Shibuya

Reports

Please submit a report for the homework that I will give on the last day

or

Please submit scribe notes for one of my three lectures

In TeX format as for the previous lectures

One volunteer (if any) for one lecture

The submitted notes will be put on the web page

Advanced Algorithms / T. Shibuya

Textbooks

D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997.

The most famous book on text processing algorithms, but many parts are out of date.

W. Sung, Algorithms in Bioinformatics, CRC Press, 2009. Good introduction for bioinformatics algorithms (mainly on text processing)

D. Salomon, Data Compression, 3rd Edition, Springer, 2004.

Related to the topic on the last day. (Very heavy book!)

Advanced Algorithms / T. Shibuya

Today's topic

Text processing algorithms

Brute-force algorithm

Knuth-Morris-Pratt algorithm

Colussi algorithm

Aho-Corasick algorithm

Boyer-Moore algorithm

Horspool algorithm

Turbo-BM algorithm

Rabin-Karp algorithm

Shift-Or method

etc.

Advanced Algorithms / T. Shibuya

Text matching

Problem

Given Text string T and a pattern (query) P

Output Substrings of T that are exactly same as P, if any.

exact matching: no insertion / deletion / modification(mutation)

Two approaches

Preprocess only the query pattern (today)

Preprocess the text beforehand (next week)

GGTGAGAAGTTATGATACAGGGTAGTTG

TGTCCTTAAGGTGTATAACGATGACATC

ACAGGCAGCTCTAATCTCTTGCTATGAG

TGATGTAAGATTTATAAGTACGCAAATT

TATAA

Text

Pattern (Query)

Advanced Algorithms / T. Shibuya

Two types of text matching algorithms

Skipping positions unnecessary to compareCheck from left

Knuth-Morris-Pratt

Aho-Corasick (for multiple queries)

Check from rightBoyer-Moore, Horspool, Turbo-BM

Brute-forceNaive algorithm

Fingerprinting (Hash-based) algorithmRabin-Karp

Bitwise computation-based algorithmShift-Or (Shift-And)

Advanced Algorithms / T. Shibuya

Naive algorithm

Just check one by one at each positionO(nm) in the worst case, but...

Linear time in average!Not so bad for cases when you have no time to implement:-)

But still it's much slower than other sophisticated algorithms in practice.

TextGGGACCAAGTTCCGCACATGCCGGATAGAAT

c

c

c

c

CCg

....

CCGt

....

Average length to check1+1/4+(1/4)2+... = 4/3 (constant!)

(for random DNA sequence)

CCGTATG

Pattern

Check one by one

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt(1)

Improvement of the brute-force algorithm

The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time

→ Knuth-Morris-Pratt Algorithm

TAGTAGC

Pattern

Check from left

AATACTAGTAGGCATGCCGGAT

t

t

TAg

t

t

TAGTAGc

t

t

TAGt

...

skip

Text

skip

We already know the text is "TAGTAG" and cannot match with the pattern in these positions before comparison

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt (2)

P[0..i] matches the text but P[i+1] does not, then

FailureLink[i+1]= max j s.t. P[0..j]≡P[i-j..i], P[j+1]≠P[i+1], and j <i if no such j exists, let FL[i]=-2 if P[i+1]=P[0], otherwise let FL[i]=-1.

FailureLink[i] can be computed before searching the text!

We can skip i+1-FailureLink[i+1] characters

Should be different(←Knuth)

Longest match with the prefixFailed matching HERE

Skip!

Falure Link

You don't have to check these positions again!

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt (3)

CTACTGATCTGATCGCTAGATGC

CTGATCTGC

CTGATCTGC

CTGATCTGC

CTGATCTGC

CTGATCGCMP skips only 4 positions KMP skips 5 positions

Text

Pattern

Skip 1 position

Failed at the first position, so just proceed

Overlap of "CTG"

No overlap

Advanced Algorithms / T. Shibuya

Knuth-Morris-Pratt (4)

Preprocessing

A naive algorithm requires O(m2) or even O(m3) time

Linear time algorithm exists

Use the KMP itself

Z algorithm [Gusfield 97]

Not faster than the KMP, but easier to understand

Advanced Algorithms / T. Shibuya

Z Algorithm (1)

Zi

Compute it for all i

Longest common prefix length of S[0..n-1] and S[i..n-1]

righti

Max value of x+Zx-1 (x<i )

lefti

x that takes the maximum value of x+Zx-1 (x<i )

Initialization

Z1=right1=left1=0

i

Zilefti righti

Zleft_i

0 Zi

Z box

Advanced Algorithms / T. Shibuya

Z Algorithm (2)

Computation of Zi +1

In case i +1≤righti

We have already computed until the position righti

In case Zi < righti -i , we can copy the answer in O(1)

Otherwise compare naively after the position righti ― ①

In case i +1>righti

Compare naively ― ②

①+② can be done in linear time in total!

Z Algorithm itself is also a text matching algorithm

Compute Zi against P$T

P: pattern, T: text, $: some character that is not in P nor T

i

Zilefti righti

Zleft_i

i'

Zi+1=Zi'+1

Zleft_i

i'+1 i+1

righti-lefti0

Advanced Algorithms / T. Shibuya

Z Algorithm (3)

Example

ATGCGCATAATGCGCTGAATGGCCATAATCTGAA

0000002016000000013000002012000011

We have done to this position

Let's compute Zi for this position!

Zi

Text

rightleftSame text

Just copy the numbers if the numbers are smaller than 3

Advanced Algorithms / T. Shibuya

Zi & Failure Link (F[])

Zii

if (FailureLink[i+Zi] = -1) FailureLink[i+Zi] = Zi -1

Compute in this order

Failure links can be obtained by just scanning the ZiTable Initialize FailureLink[] with -1

pattern GTAGGCATGTAGCGTAGG

i 0123456789........

Zi 000110004001030011

Flink AAAB00AABAAB3BAA20A: -1 B: -2

Knuth's rule (post processing)

Advanced Algorithms / T. Shibuya

Computational Time Complexity of KMP

O(m+n)

n: text length, m: pattern length

Worst-case time complexity#comparison < 2n-1

Practically, it's not faster than the Boyer-Moore or Shift-Or algorithms in ordinary

though these algorithms does not achieve the worst-case linear time

Advanced Algorithms / T. Shibuya

Colussi Algorithm (A Variation of the KMP)

#comparison < 3n/2 Check the positions with the KMP strong rule later Skip lengths are different from KMP Preprocessing is also in linear time Practically not so faster, though cf. Galil-Giancarlo algorithms achieves #comparison < 4n/3

Step 1

FailureLink[i]+1( )

G a t G c t c a t G A T G t c c G A T G C c G t

0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 5 1-1 -1 -1 -1

Step 2

G a t G c t c a t G A T G t c c G A T G C c G t

Check in this order

Strong rule

Advanced Algorithms / T. Shibuya

KMP and an automaton

A T A T T G

Failure Link

KMP can be described by an automaton

Advanced Algorithms / T. Shibuya

Aho-Corasick (1)

The automaton can be extended to deal with multiple queries!Linear time construction!

Linear time searching!

Failure Link

A

T

T

C

CG

T

T

GC

T TLink to the root if not specified

Advanced Algorithms / T. Shibuya

Aho-Corasick (2)

Construction of the keyword treeO(M) time

M: Sum of query string lengths

Can be used for dictionary searching

A

T

T

C

CG

T

T

GC

T T

Advanced Algorithms / T. Shibuya

Aho-Corasick (3)

Breadth-first searching

Start from the root

No failure link at the root

FailureLink(v)

Traverse FailureLinks of v'sparent to find a node that have a child w with the same label, and let (the nearest) w be FailureLink(v)

If no such node exists, let FailureLink(v) = root

a

b

a

c

a

b

v

w

Advanced Algorithms / T. Shibuya

Aho-Corasick (4)

Why it is linear time?

failure links to be made

1 shorter suffix

root

All the suffixes of some pattern

Existing paths from the root in the tree

traverse at most O(m) nodes

Advanced Algorithms / T. Shibuya

Aho-Corasick (5)

OutLink(v)

Pointers to the nodes with the alphabet thatv must outputs

Computation of OutLink()

Traverse the failure links to find a leaf if any

If there's no such leaf, there's no need to set the outlink

Also in linear time

1 together2 ether3 get4 her5 he

t

o g e t h e r

e t h e r

h e r

g

e t

1

2

4

5

3

Advanced Algorithms / T. Shibuya

Regular expression search based on automata (1)

Regular expression

Concatenation A, B → AB

Or A, B → A+B

Repeat A → A*

Extension of Aho-Corasick

AB(A+B)(AB+CD)*B

ABABABBBABAABBABACDBABBABBABBCDBABAABABBABAABCDB...

Advanced Algorithms / T. Shibuya

Regular expression search based on automata (2)

Construct the automaton for a regular expression

(A*B+AC)D

AB A+B A*A

B

ε Next

A B

Next

A

ε

Next

A

ε

D

C

B

A

εEnd

Start

Advanced Algorithms / T. Shibuya

Regular expression search based on automata (3)

A

ε

D

C

B

A

εEnd

Start

0 4

1

2

3

5 6

7 8

O(nm)

CDAABCAAABDDACDAAC

000000000000000000

113 11137 1 11

55 555 567556

8 8

You can start anywhere

Reachable nodes

DP

(Not including εstates)

Found!

Advanced Algorithms / T. Shibuya

Boyer-Moore (1)

Idea

Almost the same as KMP, but check from right!

Practically faster than KMPGood average-case time complexity

Bad worst-case time complexity

AATTGTTCCGGCCATGCCGGAT

......T

.....TT

....GTT

...cGTT failed

gtt...t failed

....g.t failed

GTTCGTT

Skip based on the information of "GTT"

Skip based on the information of "G"

Text

Pattern

Advanced Algorithms / T. Shibuya

Boyer-Moore (2)

Two rules Bad character rule

If the character at the failed position is x, we can move the last x in the pattern to the position

The algorithm that uses only this rule is called Horspool Algorithm

(Strong) Good suffix rule

Strong: the character before the same substring must be different This constraint was not used in the original BM algorithm

cf. Knuth's rule in KMP

Do the larger shift of the above two

Failed SuccessFailed

Success

Different = strong

Advanced Algorithms / T. Shibuya

Boyer-Moore (3)

Bad character rule example

TTCCAAGTCGCCPattern

Do not consider the last character

CCCTGTCCATGCCGTCAGCCC

TTCCAAGTCGCC

TTCCAAGTCGCC

Failed

Last T

Text

Advanced Algorithms / T. Shibuya

Boyer-Moore (4)

(Strong) Good suffix rule example

CGTATATCCAATATCPattern

AGTCCCTCGGTCCGATATCGACCCTCCCG

CGTATATCCAATATC

CGTATATCCAATATC

TextFailed

Advanced Algorithms / T. Shibuya

Boyer-Moore (5)

Preprocess

Bad character ruleVery easy

Good suffix rule

Linear time by using the Z algorithm from backward

Advanced Algorithms / T. Shibuya

Boyer-Moore (6)

Computational time complexityAverage-case O(n/min (m, alphabet size))

i.e., average-case skip length is O(min(m, alphabet size))

Horspool algorithm has the same time complexity

Worst-case O(nm)

Bad for cases:

Many repeats

» KMP is faster

Small alphabet size

» Shift-Or is faster

Linear time for finding only 1 occurrence

Good for grep in editors

Worst-case O(n) algorithms based on BM

Turbo-BM (Crochemore et al. '92), Galil (1979), Smyth (2000), Apostolico-Giancarlo (1986), etc.

Advanced Algorithms / T. Shibuya

Turbo-BM

Turbo-shift

Additional rule that can be applied for a new shift after the strong good suffix rule-based shift

A=B、but ① ≠ ② ,so B cannot overlap with

w a b c z a b c w a b c a b c a b c

a b ca b c a b ca b c

xy(≠x)

zw

w(≠z)

z

y w z x w z a b ca b c a b ca b c

strong good suffix rule

strong good suffix rule

turbo-shift

Text

Pattern

Previous shift

x

② ¬z

y

Next position

+ Consider bad character rule too.

Failed

Failed

① z

A B

B

A

Previous Current Next

Advanced Algorithms / T. Shibuya

Rabin-Karp (1)

Based on fingerprinting (i.e., hashing)

e.g.,hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q

Pattern p → hash(p)

Text

hash(T[0..|P|-1])

hash(T[1..|P|])

hash(T[2..|P|+1])

compare with hash(p) at firstO(1) computation for each

q : some prime number

Advanced Algorithms / T. Shibuya

Rabin-Karp (2)

10111(16+4+2+1) mod 5 = 3

Pattern

11001101110100101...Text

check → YES!

check → NO

(16+8+1) mod 5 = 0

((0-1·16)·2+1) mod 5 = 4

((4-1·16)·2+0) mod 5 = 1

((1-0·16)·2+1) mod 5 = 3

((3-0·16)·2+1) mod 5 = 2

((2-1·16)·2+1) mod 5 = 3

O(1)

O(1)

O(1)

O(1)

O(1)

Advanced Algorithms / T. Shibuya

Shift-And Method

Bit-parallel (32 or 64) computation!

Efficient for small alphabet size-case

ACGT

T 0001

T 0001

A 1000

T 0001

T 0001

G 0010

C 0100

G 0010

Bit representation

1 if matched

X (shift (X, 1bit) or 1) and BA

TTTACGTATTATTACGTCC..

T 01110001011011000100..

T 00110000001001000000..

A 00001000000100100000..

T 00000000000010000000..

T 00000000000001000000..

G 00000000000000100000..

C 00000000000000010000..

G 00000000000000001000..

Text

パタン

Start from 0

Advanced Algorithms / T. Shibuya

Shift-Or Method

Just reverse the bits!

((001001 << 1) OR 000001) AND 010010vs.

(110100 << 1 ) OR 1011011.5 times faster?!

Advanced Algorithms / T. Shibuya

Summary

String searching algorithms

Brute-forceNaive, Rabin-Karp, Shift-Or

From leftKMP, AC

From rightBM, Horspool, Turbo-BM

Next week

Suffix arrays

top related