rabin karp algorithm of pattern matching(goutam padhy)
Post on 28-Oct-2015
59 Views
Preview:
TRANSCRIPT
“RABIN KARP ALGORITHM OF PATTERN MATCHING”
USING C++ PROGRAMMING
Group no: 20
Project ID-15
Submitted By
GOUTAM PADHY Roll # ECE201018162
JUNE-2012
Under the guidance of
Mr. Asesh Tripathy
NATIONAL INSTITUTE OF SCIENCE & TECHNOLOGY
Palur Hills, Berhampur– 761008, Odisha, India
ACKNOWLEDGEMENT
It is our proud privilege to epitomize our deepest sense of gratitude and indebtedness to our guide, Mr. Asesh Tripathy for his valuable guidance, for giving his valuable time, support, keen and sustained interest, intuitive ideas and persistent endeavor. His inspiring assistance, laconic reciprocation and affectionate care enabled us to work smoothly and successfully. We also extend our thanks to all our friends who helped us a lot to preceding our project for their worthy remarks in preparing this report.
We acknowledge with immense pleasure the sustained interest, encouraging attitude and constant inspiration rendered by Prof. Sangram Mudali, Director, and N.I.S.T. His continued drive for better quality in everything that happens at N.I.S.T. and selfless inspiration has always helped us to move ahead.
GOUTAM PADHY
INTRODUCTION
Text processors deal with data. A sequence of like items of data such as bits, characters, or words is a
string. A problem related generic to text-editing programs, is to find all occurrences of a pattern in a text
or a record. Patterns have a context in which they apply. Most modern day word (text) processors have
the ‘Find’ option included in their software. The text is the document being edited, and the pattern
searched for is a particular supplied by the user [1]. String searching and matching algorithms are widely
used in all text-editing algorithms.
THE GENERAL PROBLEM:
The string-searching problem can be defined in simple terms as the following: given a pattern P, of length
M and a text string T, of length N, determine whether or not P appears as a substring anywhere in T. If so,
return the offset of the start of the occurrence(s) of P in T. If not, indicate that no such occurrence(s)
exists. The elements of P and T are characters drawn from a finite alphabet. They could be all valid
numbers and alphabets. Considering an example, let P = “ring“ and T = “String Searching using Rabin
Karp algorithm”. The string “ring” does appear in T, starting at index position 2(using zero based
indexing).We formalize the string-matching problem as follows. We assume that the text is an array
T[1..N] of length ‘N‘ and that pattern is an array P[1..M] of length ‘M’. We further assume that the
elements of P and T are characters drawn from a finite alphabet . The character arrays P and T are often
called strings of characters . We say that the pattern P occurs with shift ‘s’ in text T if 0 s n-m and
T[s+1..s+m] = P[1..m]. If P occurs with shift ‘s’ in T, we call ‘s’ a valid shift, otherwise we call ‘s’ an
invalid shift . The string-matching problem is the problem of finding all valid shifts with which a given
pattern P occurs in a given text T.
MOTIVATION FOR THE PROJECT:
We are searching for the better pattern matching algorithm to search the particular pattern
in the D.N.A which is in in the range of lakh’s of crores. So it will take years to find the
particular pattern present in the D.N.A. so while searching for the this pattern matching
algorithm to solve this kind of problem we found that Rabin-Karp algorithm is the best. So we
are motivated for this project.
THEORY BEHIND THE CONCEPT
1. Requirements
2. Coding Main
3. Working principle
4. Hashing
5. Comparison of the two strings
6. Shifting of variable
7. The equivalent condition
8. Applications
REQUIREMENTS: -
Previous knowledge of c language. Any c++ compiler i.e. Devc++ IDE
CODING MAIN: -
First thing a program should do is to include header files and code main. Main act as entry point of program and is the first function called by your program, value it returns is used by operating system. Only header we will need at present is graphics h. Including unnecessary header files increase the size of the code and compile time. Next we create a parameter less int returning main with no code at present in the source file.
WORKING PRINCIPLE:-Rabin Karp algorithm performs well in practice and also generalized to other algorithm for related problems, such as two-dimensional pattern matching. The worst case running time is O((n-m+1)*m). Let's assume that the character used in string T is a set S { 0,1,2,...,9 }. We can view a string of k consecutive character as decimal numbers. For example, string "31415" corresponds to the number 31415.
Given a pattern P[1..m], let p denote its corresponding decimal value. Given a text T[1..n], we let t(s) denote the decimal value of length-m substring T[s+1..s+m] for s = 0,1,...,n-m. Obviously, t(s) = p if and only if T[s+1..s+m] = P[1..m].
We can p compute in O(m) time using Horner's Rule
p = P[m] + 10(P[m-1] + 10(P[m-2] + ... + 10 (P[2] + 10 P[1])...)).
pseudo code for above assumption is
result = 0;for i = 1 to mresult = result * 10;result = result + P[i]
We can compute all t(s), for s = 0,1,...,n-m values in a total of O(n) time. The value of t(0) can be similarly computed from T[1..m] in O(m) time. To compute the remaining values t(1), t(2), ..., t (n-m), observe that t(s+1) can be computed can be computed from t(s) in constant time.
t(s+1) = 10*(t(s) - 10^(m-1) * T[s+1]) + T[s + m + 1].
For example,T = "123456" and m = 3t(0) = 123t(1) = 10*(123 - 100*1) + 4 = 234
Step by Step explanationFirst : remove the first digit : 123 - 100*1 = 23Second: Multiply by 10 to shift it : 23 * 10 = 230Third : Add last digit : 230 + 4 = 234
The algorithm runs by comparing, t(s) with p. When t(s) = p, then we have found the substring P in T, starting from position s.
GeneralizationThe only problem with above explanation is t(s) and p may be too large, so that no built-in data type can fit them. The solution is we required all t(s) and p be performed in modulo q.
In general, if we have d-ary alphabet { 0,1,...,d-1 }, we could have 26 alphabet { a, b, ..., z }, we choose q so that dq fits within a computer word. We can work out p modulo q by using this pseudo code
result = 0;for i = 1 to mresult = (d*result + P[i] ) mod q
and computation of t(s+1) become
t(s+1) = d*(t(s) - T[s+1]*h) + T[s + m + 1]) mod q, where h = d^(m-1) (mod q)
The weakness of this method is that, ts = p (mod q) doesn't imply that ts = p. On the other hand, if ts != p (mod q), we definitely know that ts != p. So we can use ts != p (mod q) as a fast heuristic test to rule out invalid shifts s. But if we have ts = p we have have to check whether T[s+1...s+m] = P[1..m]
HASHING:-
The key to Rabin–Karp performance is the efficient computation of hash values of the successive substrings of the text. One popular and effective rolling hash function treats every substring as a number in some base, the base being usually a large prime. For example, if the substring is "hi" and the base is 101, the hash value would be 104 × 1011 + 105 × 1010 = 10609 (ASCII of 'h' is 104 and of 'i' is 105).
Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See hash for a much more detailed discussion. The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 × 1012 (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 × 1010 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited size of the integer data type and the necessity of using modular arithmetic to scale down the hash results, for which see hash function article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many hash collisions and hence slow down the algorithm. Hence the described hash function is typically the preferred one in Rabin–Karp.
USE OF HASHING FOR SHIFTING SUBSTRING SEARCH
Rather than pursuing more sophisticated skipping, the Rabin–Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; for example, we might have hash("hello")=5. Rabin–Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value.
However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good.
IMPLEMENTATION DETAILS
The algorithm is as shown:
1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]); hs := hash(s[1..m]) 3 for i from 1 to n-m+15 if s[i..i+m-1] = sub 6 return i 7 hs := hash(s[i+1..i+m]) 8 returns not found
Lines 2, 5, and 7 each require Θ(m) time. However, line 2 is only executed once, and line 5 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 4 is executed n times, but only requires constant time. So the only problem is line 7.
If we naively recompute the hash value for the substring s[i+1..i+m], this would require Θ(m) time, and since this is done on each loop, the algorithm would require Ω(mn) time, the same as the most naive algorithms. The trick to solving this is to note that the variable hs already contains the hash value of s[i..i+m-1]. If we can use this to compute the next hash value in constant time, then our problem will be solved.
We do this using what is called a rolling hash. A rolling hash is a hash function specially designed to enable this operation. One simple example is adding up the values of each character in the substring. Then, we can use this formula to compute the next hash value in constant time:
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.
Notice that if we're very unlucky, or have a very bad hash function such as a constant function, line 5 might very well be executed n times, on every iteration of the loop. Because it requires Θ(m) time, the whole algorithm then takes a worst-case Θ(mn) time.
Shifting substrings search and competing algorithms
A brute-force substring search algorithm checks all possible positions:
1 function NaiveSearch(string s[1..n], string sub[1..m])2 for i from 1 to n-m+13 for j from 1 to m4 if s[i+j-1] ≠ sub[j]5 jump to next iteration of outer loop
6 return i7 return not found
This algorithm works well in many practical cases, but can exhibit relatively long running times on certain examples, such as searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s, in which case it exhibits its worst-case Θ(mn) time.
The Knuth–Morris–Pratt algorithm reduces this to Θ(n) time using precomputation to examine each text character only once; the Boyer–Moore algorithm skips forward not by 1 character, but by as many as possible for the search to succeed, effectively decreasing the number of times we iterate through the outer loop, so that the number of characters examined can be as small as n/m in the best case. The Rabin–Karp algorithm focuses instead on speeding up lines 3-6.
APPLICATIONS Parsers
Spam filters
Digital libraries
Screen scrapers
Word processors
Web search engines
Natural language processing
Computational molecular biology
Feature detection in digitized images
RESULT
We got the output after the execution of the programme.
FUTURE SCOPE
This algorithm can not only be used for string matching rather we can use it in Graphics design where pattern matching is required. Due to the limitation of time we have not shown this design in our project, but it can be implemented in the future where ever it is required.
CONCLUSION
Rabin-Karp algorithm gives a better run time performance of (N+M), than the naïve brute force string matching algorithm ((N-M) M). The Rabin-Karp algorithm can be very slow if the text contains a lot of false matches. Sub strings that hash to the same number as the pattern, cause expensive string compares to be performed. A lot of tricks as shown in the paper, make it faster. The implementation of Rabin-Karp algorithms takes into account these tricks. The first step is to come up with a hashing function. The next action taken is to compute , which will be used later to figure out what amount to subtract from the hash value as characters are ``shifted off'' to the left. The next action is to laboriously hash the pattern and the substring of length M at shift zero of the text. This is the only time we'll hash a substring of the text using the hashing function; all subsequent hashes will be computed by the method described in the paper. Before we begin working our way down the string, however, we must check whether shift zero itself is a match. If it is, then we are finished. Otherwise, we loop through the rest of the string, computing each has based on the previous.
APPENDIX-1:CODE
#include <iostream> //This may differ from compiler to compiler
#include <vector>
#include <string>
#include <cmath>
#include<conio.h>
#include<math.h>
using namespace std;
void rabin_karp( string T, string P, int d, int q )
{
int n;
int m;
long long int h;
long long int p;
long long int t;
int i;
int s;
int f;
string str;
n = T.size();
m = P.size();
h = int( pow( d, m-1 ) ) % q;
p = 0;
t = 0;
f = 0;
for( i = 0; i < m; i++ ) {
p = ( d * p + ( int(P[i]) - '0' ) ) % q;
t = ( d * t + ( int(T[i]) - '0' ) ) % q;
}
for( s = 0; s < n-m+1; s++ ) {
if( p == t ) {
str = T.substr(s,m);
if( P == str ) {
f = 1;
cout << "Pattern occur with shift " << s << endl;
}
}
if( s < n-m ) {
t = ( d * ( t - ( int(T[s]) - '0' ) * h ) + int(T[s+m]) - '0' ) % q;
if( t < 0 ) {
t = q + t;
}
}
}
if( f != 1 ) {
cout << "The Pattern is not found" << endl;
}
}
int main()
{
int d;
int q;
string T;
string P;
cout << "Enter the test string: ";
cin >> T;
cout << "Enter the pattern: ";
cin >> P;
cout << "Enter the radix: ";
cin >> d;
cout << "Enter the number to be used as modulus: ";
cin >> q;
rabin_karp( T, P, d, q );
return 0;
ALL RIGHTS ARE BEING RESERVED TO GOUTAM PADHY
COPY RIGHT PROTECTED@GOUTAM PADHY
FOR MORE DETAILS CONTACT:
Phone number:+918093611841(self)
+918093541377
EMAIL:goutampadhy162@gmail.com
gmrules162@gmail.com
top related