4 report format

21
String Matching Algorithms ABSTRACT String Matching addresses the problem of finding all occurrences of a pattern string in a text string. Pattern matching algorithms have many practical applications. Computational molecular biology and the world wide web provide settings in which ecient pattern matching algorithms are essential. New problems are constantly being de ned. Original data structures are introduced and existing data structures are enhanced to provide more ecient solutions to pattern matching problems. In this survey, we review pattern matching algorithms in one and two dimensions. We focus on several speci c problems, among them small space pattern matching, parameterized matching, and dictionary matching. We also discuss the indexing problem, for which the classical data structure is the sux tree, or alternatively, the sux array. _________________________________________________________________ ____________ 12IT083 1

Upload: ashikapokiya12345

Post on 15-Apr-2017

110 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: 4 report format

String Matching Algorithms

ABSTRACT

String Matching addresses the problem of finding all occurrences of a pattern string in a

text string. Pattern matching algorithms have many practical applications. Computational

molecular biology and the world wide web provide settings in which efficient pattern matching

algorithms are essential. New problems are constantly being defined. Original data structures are

introduced and existing data structures are enhanced to provide more efficient solutions to pattern

matching problems. In this survey, we review pattern matching algorithms in one and two

dimensions. We focus on several specific problems, among them small space pattern matching,

parameterized matching, and dictionary matching. We also discuss the indexing problem, for

which the classical data structure is the suffix tree, or alternatively, the suffix array.

_____________________________________________________________________________12IT083 1

Page 2: 4 report format

String Matching Algorithms

ACKNOWLEDGEMENT

It gives me immense pleasure in presenting seminar on “STRING MATCHING

ALGORITHM”. I would like to acknowledge the contribution of certain distinguished people,

without their support and guidance this seminar would not have been concluded.

First of all I thank God, the almighty for blessing me in making my seminar a successful one. I

hereby wish to express my gratitude to our principal Dr. Niraj D Shah for providing us all

facilities. I also express my sincere gratitude to Prof. Parth Shah Head of Department of

Information Technology, His guidance and support give shape this seminar in a systematic way.

I am also greatly indebted to Ms. Nehal Patel for her valuable suggestions in the preparation of the

seminar. In addition I would like to tanks all staff members of I.T department and all my friends

for their suggestions and constructive criticism.

Ashika Pokiya

_____________________________________________________________________________12IT083 2

Page 3: 4 report format

String Matching Algorithms

CHAPTER 1

INTRODUCTION

1.1 Introduction Text-editing programs frequently need to find all occurrences of a pattern in the text.

Typically, the text is a document being edited, and the pattern searched for is a particular word

supplied by the user. Efficient algorithms for this problem called “String Matching”. It can greatly

aid the responsiveness of the text editing program. Among their many other application, string

matching algorithm search for particular patterns in DNA sequence. Internet search engines also

use them to find Web pages relevant to queries [5].

We formalize the string matching problem as follows. We assume that the text is an array

T[1….n] of length n and that the pattern is an array P[1….m] of length m<=n. We further assume

that the elements of P and T are characters drawn from a finite alphabet ∑. For example, we may

have ∑ = {0,1} or ∑ = {a,b,….,z}. The character arrays P and T are often called string of character

[5].

A B A C C A C C B A

Figure 1 An example of string matching problem, where we want to find all occurrences

of the pattern P = CCACC in the text T = ABACCACCBA. The pattern occurs only once

in the text, at shift S = 3, which we can call a valid shift. A vertical line connects each

character of the pattern to its matching character in the text [5].

Referring Figure 1, we say that pattern P occurs with shift s in the text T (or, equivalently,

that pattern P occurs beginning at position S+1 in text T) if 0 <= s <= n-m and T[S+1…S+m] =

P[1….m]. If P occurs with shift S in T, then we call S a valid shift; otherwise we call S an invalid

_____________________________________________________________________________12IT083 3

C C A C C

Text T

Pattern P S = 3

Page 4: 4 report format

String Matching Algorithms

shift. The string matching problem is the problem of finding all valid shift with which a given

pattern P occurs in given text T [5].

1.2 Types of String matching Algorithms [5]There are many types of String matching algorithms like,

(1) The Naive String matching

(2) The Rabin-Karp String matching

(3) String matching with finite automata

(4) The Knuth-Morris-Pratt

But we discuss about 2 types of String matching algorithms

1) The Naive String matching algorithm

2) The Rabin-Karp algorithm

ALGORITHMS PROCESSING TIME MATCHING TIME

Naïve 0 O((n-m+1)m)

Rabin-Karp Θ(m) O((n-m+1)m)

Figure 2 The String matching algorithms in this chapter and their preprocessing and

matching times.

Except for the naive brute-force algorithm, which we review in figure 1, each string

matching algorithm in this chapter performs some preprocessing based on the pattern and then

finds all valid shift; we call this latter phase “matching”. Figure 2 shows preprocessing and

matching times for each of the algorithms in this chapter. The total running time of each algorithm

is the sum of the preprocessing and matching times.

Figure 2 presents an interesting string matching algorithm, due to Rabin and Karp.

Although the Θ ((n-m+1)m) worst-case running time of this algorithm is no better than that of the

naive method, it works much better on average and in practice. It also generalizes nicely to other

pattern-matching problems.

_____________________________________________________________________________12IT083 4

Page 5: 4 report format

String Matching Algorithms

CHAPTER 2

THE NAIVE STRING MATCHING ALGORITHM

2.1 What is Naive string matching?The Naive String Matching algorithm slides the pattern one by one. After each slide, it one by

one checks characters at the current shift and if all characters match then prints the match. [1]

2.2 Algorithm for naive string matching [5]The naive algorithm finds all valid shifts using a loop that checks the condition

P[1…m] = T[S+1….S+m] for each of the n – m + 1 possible values of S.

NAIVE-STRING-MTCHER (T, P)

(1) n = T.length

(2) m = P.length

(3) for S = 0 to n – m

(4) if P[1…m] == T[S+1…..S+m]

(5) printf “Pattern occurs with shift” S

2.3 How It Works [5]Example portrays the naive string matching procedure as sliding a “template” containing

the pattern over the text, noting for which shifts all of the characters on the template equal the

corresponding characters in the text. The for loop of line 3-5 considers each possible shift

explicitly. The test on line 4 determines whether the current shift is valid or not; this test

Involves an implicit loop to check corresponding character positions until all positions match

successfully or a mismatch is found. Line 5 prints out each valid shift S.

Procedure NAIVE-STRING-MATCHER takes time O ((n − m + 1)m), and this bound is

tight in the worst case. For example, consider the text string an (a string of n a’s) and the pattern

am. For each of the n−m+1 possible values of the shift S, the implicit loop on line 4 to compare

Corresponding characters must execute m times to validate the shift. The worst-case running time

is thus Θ((n −m +1)m), which is Θ(n2)if m = [n/2]. Because it requires no processing,

NAIVE-STRING-MATCHER is running time equals its matching time._____________________________________________________________________________

12IT083 5

Page 6: 4 report format

String Matching Algorithms

2.4 Example [3]

P R Q P P Q R

P R Q P P Q R

Figure 3 The operation of the naive string matcher for the pattern P = PPQ and the text

T = PRQPPQR. We can imagine the pattern P as a “template” that we slide next to the text.

(a)–(d) The four successive alignments tried by the naive string matcher. In each part, vertical

_____________________________________________________________________________12IT083 6

P P Q

P R Q P P Q R

P P Q

P R Q P P Q R

P P Q

P P Q

Text T

Pattern P S=0

S=1

S=2

S=3

Page 7: 4 report format

String Matching Algorithms

lines connect corresponding regions found to match (shown shaded), and a jagged line

connects the first mismatched character found, if any. One occurrence of the pattern is found,

at shifts = 3.

As we shall see, NAIVE-STRING-MATCHER is not an optimal procedure for this problem.

Indeed, in this chapter we shall show an algorithm with a worst-case preprocessing time of (m)

and a worst-case matching time of (n). The naive string-matcher is inefficient because information

gained about the text for one value of s is entirely ignored in considering other values of s. Such

information can be very valuable, however. For example, if P = aaab and we find that s = 0 is

valid, then none of the shifts 1, 2, or 3 are valid, since T[4] = b. In the following sections, we

examine several ways to make effective use of this sort of information.

Time Complexity Analysis [6]

Worst-case:

– Outer loop: n – m

– Inner loop: m

– Total (n–m)m = O(nm)

Best-case:

– n-m

Completely random text and pattern:

– O(n–m)

_____________________________________________________________________________12IT083 7

Page 8: 4 report format

String Matching Algorithms

CHAPTER 3

THE RABIN-KARP ALGORITHM

3.1 What is Rabin-Karp algorithm?Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But

unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern with the

hash value of current substring of text, and if the hash values match then only it starts matching

individual characters. [2]

3.2Algorithm for Rabin-Karp[5]

P[1. .m] = T[s +1. .s +m]. If q is large enough, then we can hope that spurious hits occur

infrequently enough that the cost of the extra checking is low.

The following procedure makes these ideas precise. The inputs to the procedure are the text T, the

pattern P, the radix d to use (which is typically taken to be ∑), and the prime q to use.

RABIN-KARP-MATCHER (T, P, d, q)

1) n ← length[T]

2) m ← length[P]

3) h ← dm−1 mod q

4) p ← 0

5) t0 ← 0

6) for i = 1 to m // Preprocessing.

7) p = (dp + P[i]) mod q

8) t0 = (dt0 + T[i]) mod q

9) for s = 0 to n −m // Matching.

10) if p = ts

11) if P[1. .m] = T[s +1. .s +m]

12) print “Pattern occurs with shift” s

13) if s < n −m

14) ts+1 ← (d(ts − T[s +1]h)+ T[s +m +1]) mod q_____________________________________________________________________________

12IT083 8

Page 9: 4 report format

String Matching Algorithms

3.3 How It WorksThe procedure RABIN-KARP-MATCHER works as follows. All characters are interpreted

as radix-d digits. The subscripts on t are provided only for clarity; the program works correctly if

all the subscripts are dropped. Line 3 initializes h to the value of the high-order digit position of an

m-digit window. Lines 4–8 compute p as the value of P[1. .m] mod q and t0 as the value of

T[1. .m] mod q. The for loop of lines 9–14 iterates through all possible shifts s, maintaining the

following invariant: [5]

Whenever line 10 is executed, ts = T[s +1. .s +m] mod q

RABIN-KARP-MATCHER takes (m) preprocessing time, and its matching time is ((n − m + 1)m) in the worst case, since (like the naive string-matching algorithm) the Rabin-Karp

algorithm explicitly verifies every valid shift. If P = am and T = an, then the verifications take

time _((n − m + 1)m), since each of the n − m + 1 possible shifts is valid. [6]

3.4Example[4]

Pattern P=26, how many spurious hits does the Rabin Karp matcher in the text

T=3 1 4 1 5 9 2 6 5 3 5.

T = 3 1 4 1 5 9 2 6 5 3 5

P = 2 6

Here T.length=11 so Q=11

and P mod Q = 26 mod 11= 4

Now find the exact match of P mod Q…

3 1 4 1 5 9 2 6 5 3 5

S=0 31 mod 11 = 9 not equal to 4

3 1 4 1 5 9 2 6 5 3 5

_____________________________________________________________________________12IT083 9

Page 10: 4 report format

String Matching Algorithms

S=1 14 mod 11 = 3 not equal to 4

3 1 4 1 5 9 2 6 5 3 5S=2 41 mod 11 = 8 not equal to 4

3 1 4 1 5 9 2 6 5 3 5S=3 15 mod 11 = 4 equal to 4 SPURIOUS HIT

3 1 4 1 5 9 2 6 5 3 5S=4 59 mod 11 = 4 equal to 4 SPURIOUS HIT

3 1 4 1 5 9 2 6 5 3 5S=5 92 mod 11 = 4 equal to 4 SPURIOUS HIT

NOTE: - Spurious hit is when we have a match but it isn’t an actual match to the pattern. When

this happen, further testing is done.

3 1 4 1 5 9 2 6 5 3 5S=6 26 mod 11 = 4 EXACT MATCH

3 1 4 1 5 9 2 6 5 3 5S=7 65 mod 11 = 10 not equal to 4

3 1 4 1 5 9 2 6 5 3 5S=8 53 mod 11 = 9 not equal to 4

3 1 4 1 5 9 2 6 5 3 5

S=9 3 5 mod 11 = 2 not equal to 4

Pattern occurs with shift 6

Time Complexity Analysis_____________________________________________________________________________

12IT083 10

Page 11: 4 report format

String Matching Algorithms

The running time of Rabin-Karp matcher is ((N-M+1) M) in the worst case, since the Rabin-

Karp algorithm explicitly verifies each valid shift. If P = aM and T = aN, then the verifications take

((N-M+1) M), since each of the N-M+1 possible shifts is valid [1]. In many applications only a

few valid shifts (perhaps (1) of them) and so the expected running time of the algorithm is

(N+M) plus the time required to process spurious hits [1].

CHAPTER 4

COMPARISON OF ALGORITHMS

4.1 Between Naive algorithm and Rabin-Karp algorithm

The Naive String Matching algorithm slides the pattern one by one. After each slide, it one

by one checks characters at the current shift and if all characters match then prints the

match. [1]

_____________________________________________________________________________12IT083 11

Page 12: 4 report format

String Matching Algorithms

Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But

unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern

with the hash value of current substring of text, and if the hash values match then only it

starts matching individual characters. [2]

Rabin-Karp algorithm gives a better run time performance of (N+M), than the naive brute

force string matching algorithm ((N-M) M). The Rabin-Karp algorithm can be very slow

if the text contains a lot of false matches. Sub strings that hash to the same number as the

pattern, cause expensive string compares to be performed. A lot of tricks as shown in the

paper, make it faster. The implementation of Rabin-Karp algorithms takes into account

these tricks. The first step is to come up with a hashing function. The next action taken is

to compute , which will be used later to figure out what amount to subtract from the hash

value as characters are ``shifted off'' to the left. The next action is to laboriously hash the

pattern and the substring of length M at shift zero of the text. This is the only time we'll

hash a substring of the text using the hashing function; all subsequent hashes will be

computed by the method described in the paper. Before we begin working our way down

the string, however, we must check whether shift zero itself is a match. If it is, then we are

finished. Otherwise, we loop through the rest of the string, computing each hash based on

the previous. [6]

CHAPTER 5

ADVANTAGES OF STRING MATCHING

5.1 ADVANTAGES

– Text-editing programs frequently need to find all occurrences of a pattern in the text.

Typically, the text is a document being edited, and the pattern searched for is a particular word

supplied by the user. [5]

_____________________________________________________________________________12IT083 12

Page 13: 4 report format

String Matching Algorithms

– Efficient algorithms for this problem called “String Matching”. It can greatly aid the

Responsiveness of the text editing program. [5]

– Among their many other application, string matching algorithm search for particular

patterns in DNA sequence. Internet search engines also use them to find Web pages relevant to

queries. [5]

CHAPTER 6

CONCLUSION

Rabin-Karp algorithm gives a better run time performance of (N+M), than the naive brute force

string matching algorithm ((N-M) M). The Rabin-Karp algorithm can be very slow if the text

contains a lot of false matches. Sub strings that hash to the same number as the pattern, cause

expensive string compares to be performed. A lot of tricks as shown in the paper, make it faster.

The implementation of Rabin-Karp algorithms takes into account these tricks. The first step is to

come up with a hashing function. The next action taken is to compute , which will be used later to

figure out what amount to subtract from the hash value as characters are ``shifted off'' to the left.

The next action is to laboriously hash the pattern and the substring of length M at shift zero of the

_____________________________________________________________________________12IT083 13

Page 14: 4 report format

String Matching Algorithms

text. This is the only time we'll hash a substring of the text using the hashing function; all

subsequent hashes will be computed by the method described in the paper. Before we begin

working our way down the string, however, we must check whether shift zero itself is a match. If

it is, then we are finished. Otherwise, we loop through the rest of the string, computing each hash

based on the previous.

CHAPTER 7

REFERENCES

1. http://www.cs.utexas.edu/~eberlein/cs337/patternMatching.pdf

2. http://www.geeksforgeeks.org/searching-for-patterns-set-3-rabin-karp-algorithm/

3. http://www.youtube.com/watch?v=uhMFMBpKih4

4. http://www.youtube.com/watch?v=M_XpGQyyqIQ

5. Introduction to Algorithm by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest

and Clifford Stein

6. http://compalg.inf.elte.hu/~tony/Oktatas/Parhuzamos-algoritmusok/List%20ranking/

Almaangolul.pdf

_____________________________________________________________________________12IT083 14

Page 15: 4 report format

String Matching Algorithms

_____________________________________________________________________________12IT083 15