extending boyer-moore algorithm to an abstract string matching problem

3
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem Liwei Ren Data Center Research Trend Micro Cupertino, USA e-mail: [email protected] AbstractThe bad character shift rule of Boyer-Moore string search algorithm is studied in this paper for the purpose of extending it to more general string match problems. An abstract problem of string match is defined in general. An optimized string match algorithm based one the bad character heuristics is proposed to solve the abstract match problem efficiently. Keywords: pattern; string; sequence; search; match; bad character; Boyer-Moore I. INTRODUCTION String searching is a classic problem in many text processing applications. Among many string searching algorithms, Boyer-Moore algorithm [1] is a particular efficient one for single pattern string match. It uses both the concepts of good suffix shift and bad character heuristics to accelerate the string match. Two shift tables are established to determine how many shifts to make after match fails. The algorithm shifts the pattern according to the larger shift given by two shift tables. The Horspool algorithm [2] is the best known variant of Boyer-Moore algorithm. It only uses the bad character heuristics to build the shift table. There are other variants as well such as the algorithms given by Raita [3] and Sunday [4]. In summary, the essence of all the Boyer-Moore style algorithms is to skip the unnecessary character comparisons as many as possible. If we introduce the concept of match window as a substring of the reference string , the naïve string searching algorithm is basically a sliding window match algorithm with N-M+1 match windows, where N and M are the sizes of the reference string and the pattern respectively. Hence, in practice, the Boyer-Moore algorithm selects only a few of candidate match windows that possibly contains the target strings. This is done by ruling out many windows that definitely have no target substrings. The bad character shift with Boyer-Moore algorithm can take a weaker form as character identity verification. It verifies whether a given character in the reference string belongs to the alphabet of the search pattern or not. We can extends the concepts of both match window and character identity verification to other string match problems, for instance, the regular expression based pattern match problem which has many applications in practice. This paper proposes an abstract problem of string match which includes the two classic string matching problems, i.e., single pattern string search and regular expression pattern match, as the special cases. An efficient algorithm is constructed to solve the abstract problem based on the concepts of match window and character identity verification. II. A GENERAL PROBLEM OF STRING MATCH In this section, we uses an abstract model to present string match problems in more general terms. With this model, many practical problems can be covered beyond the scope of both single pattern string searching and regular expression based string matching. Before we define the problem, lets observes the follows from classic string match problems: 1. The target string has a small alphabet S when comparing to the whole character space. In the case of single pattern string search problem, S consists of all unique characters of the pattern string. In the case of regular expression match, it is typical that most entities defined by regular expression patterns in practical applications have small alphabets as well. Examples of these entities include IP addresses, dates, credit card numbers, bank account numbers , ID numbers and etc.. 2. The target strings have well-defined minimum and maximum lengths. This is obvious with the single pattern search problem. As to the regular expression match, it is not uncommon that these two numbers can be pre-defined. For example, to match master credit card number from a text, the minimum length is 16 while the maximum length can be defined as 19 if one also includes the format dddd-dddd-dddd-dddd.

Upload: liwei-ren

Post on 16-Feb-2017

62 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Extending Boyer-Moore Algorithm to an Abstract String Matching Problem

Extending Boyer-Moore Algorithm to an Abstract String Matching Problem

Liwei Ren

Data Center Research

Trend Micro

Cupertino, USA

e-mail: [email protected]

Abstract— The bad character shift rule of Boyer-Moore string

search algorithm is studied in this paper for the purpose of

extending it to more general string match problems. An abstract

problem of string match is defined in general. An optimized string

match algorithm based one the bad character heuristics is

proposed to solve the abstract match problem efficiently.

Keywords: pattern; string; sequence; search; match; bad

character; Boyer-Moore

I. INTRODUCTION

String searching is a classic problem in many text

processing applications. Among many string searching algorithms, Boyer-Moore algorithm [1] is a particular efficient one for single pattern string match. It uses both the concepts of good suffix shift and bad character heuristics to accelerate the string match. Two shift tables are established to determine how many shifts to make after match fails. The algorithm shifts the pattern according to the larger shift given by two shift tables.

The Horspool algorithm [2] is the best known variant of

Boyer-Moore algorithm. It only uses the bad character heuristics to build the shift table. There are other variants as well such as the algorithms given by Raita [3] and Sunday [4].

In summary, the essence of all the Boyer-Moore style algorithms is to skip the unnecessary character comparisons as many as possible.

If we introduce the concept of match window as a

substring of the reference string , the naïve string searching algorithm is basically a sliding window match algorithm with N-M+1 match windows, where N and M are the sizes of the reference string and the pattern respectively. Hence, in practice, the Boyer-Moore algorithm selects only a few of candidate match windows that possibly contains the target strings. This is done by ruling out many windows that definitely have no target substrings.

The bad character shift with Boyer-Moore algorithm can

take a weaker form as character identity verification. It verifies whether a given character in the reference string belongs to the alphabet of the search pattern or not.

We can extends the concepts of both match window and

character identity verification to other string match problems, for instance, the regular expression based pattern match problem which has many applications in practice.

This paper proposes an abstract problem of string match

which includes the two classic string matching problems, i.e., single pattern string search and regular expression pattern match, as the special cases.

An efficient algorithm is constructed to solve the abstract

problem based on the concepts of match window and character identity verification.

II. A GENERAL PROBLEM OF STRING MATCH

In this section, we uses an abstract model to present

string match problems in more general terms. With this model, many practical problems can be covered beyond the scope of both single pattern string searching and regular expression based string matching.

Before we define the problem, lets observes the follows

from classic string match problems: 1. The target string has a small alphabet S when

comparing to the whole character space. In the case of single pattern string search problem, S consists of all unique characters of the pattern string. In the case of regular expression match, it is typical that most entities defined by regular expression patterns in practical applications have small alphabets as well. Examples of these entities include IP addresses, dates, credit card numbers, bank account numbers , ID numbers and etc..

2. The target strings have well-defined minimum and maximum lengths. This is obvious with the single pattern search problem. As to the regular expression match, it is not uncommon that these two numbers can be pre-defined. For example, to match master credit card number from a text, the minimum length is 16 while the maximum length can be defined as 19 if one also includes the format dddd-dddd-dddd-dddd.

Page 2: Extending Boyer-Moore Algorithm to an Abstract String Matching Problem

Pattern Match Function: For any given reference

string R and the match window R[s,e], a pattern match function F can extract a target string, based on well-defined matching rules, from the window R[s,e] if there is any, otherwise it returns NIL. The function can be denoted as F(R,s,e). The match mechanism is defined inside F itself.

Abstract Problem of String Match: The string match problem is to retrieve all target substrings from a given reference string R[1,…,N] with pattern match function F(R, s,t), where the pattern match function F defines what the target substrings should be with the following conditions:

All target substrings consist of characters from a small alphabet S.

The length of each target substring falls in the interval [m,M] where m is the minimum length and M the maximum.

Both single pattern string search and regular expression

pattern search are special cases of this abstract match problem.

Yet another example is the problem of regular

expression pattern match with checksum validation that requires all target substrings must be validated by a checksum. This example is useful for data discovery systems for minimizing false positives.

III. OPTIMIZED STRING MATCH ALGORITHM

A naïve algorithm to solve the abstract problem of string

match can be easily constructed. It is based on the mechanism of sliding match windows.

Naïve String Match Algorithm : One starts from the 1

st

match window R[1,M]. Call match function F. If a match exists, obtain the target substring and move to the next match window immediately after the target substring, otherwise, slide the match window one step further. Repeat this until the reference string R is exhausted.

With the naïve string match, one will go through N-M+1

matching windows if there is no target string at all. That is not efficient.

We can reduce the number of matching windows if we

are able to determine quickly that a match windows does not contain a target string at all. That can be done with the character identity verification. Lets construct the optimized algorithm as follows.

Optimized String Match Algorithm:

Input: Minimum length m, maximum length M, target string alphabet S, pattern match function F, reference string R[1,…,N] Matching Procedure:

Step 1: set s=1

Step 2: Let r= MIN(s+M-1, N)

Step 3: If r-s<m-1, RETURN

Step 4: Set match window as W=T[s, …,r]

Step 5: Set sub-window w=T[s,…,s + m - 1]. Lets find

out the rightmost character T[s + p] that does not belong

to S, set s = s + p, go to step 2

Step 6: Otherwise, all characters of sub-window w pass

identity verification. Lets match with the function

F(R,s,r):

a. If result is NIL, let s=s+1

b. If a target substring is matched as T[t,e], save

it, let s=e+1

Step 7: Go to step 2 Output: Matches

IV. ANALYSIS OF THE ALGORITHM

The algorithm starts with the first match window defined

by step 1. The key step for optimization is step 5. Step 5 does the identity verification for characters in the sub-window w. The verification is done character by character from the rightmost of the sub-window. When any character fails the verification, we slide the match window ahead with multiple steps instead of one step. This step is somewhat like the Raita’s [3] multiple point checking. It may cost more time when the target substring does exist in the window, however, in most cases, it reduces the number of matching windows by shifting multiple steps. The best case is that we shift m steps ahead if no character in w belongs to S. The step 6 does the pattern match. If the match fails, unlike the Boyer-Moore or Horspool algorithms, there is no shift table that advises shifting more than one step.

The optimized algorithm is not designed to exceed

Boyer-Moore algorithm or its variants for single pattern string match. Instead, its purpose is to extend the concept of bad character shift rule to more general case. This extension has immediate applications in two special pattern match problems:

Regular expression pattern match.

Regular expression pattern match with checksum validation.

Example 1: One needs to search all social security

numbers (SSN) from a text with the regular expression pattern defined as \d{9}|\d{3}-\d{2}-\d{3}. The alphabet S={0,1,2,3,4,5,6,7,8,9,-} has 11 characters. The minimum and maximum length for SSN are 9 and 11 respectively. The best case is that we do not need to apply regular expression pattern match at all if the text does not contain any numbers or -.

Example 2: One needs to search Master or Visa credit

card numbers (CCN) from a text with the regular expression pattern defined as \d{16}|\d{4}-\d{4}-\d{4}-

Page 3: Extending Boyer-Moore Algorithm to an Abstract String Matching Problem

\d{4}. The alphabet S={0,1,2,3,4,5,6,7,8,9,-} has 11 characters. The minimum and maximum lengths for SSN are 16 and 19 respectively. The checksum applies the Luhn algorithm [5] to validate the CCN.

V. PROBLEM OF MATCHING SEQUENCE OF OBJECTS

This paper has been focusing on problem of string

search. Due to the fact that we have been using general terms to discuss the problem and the solution, the abstract problem of string match can be extended to more general problem. This is the problem of sequence match if we define a sequence as a sequence of objects and a subsequence of objects as a consecutive subsequence. We can achieve this by extending two basic concepts --- character and string. Lets use object instead of character and sequence instead of string. Then pattern match function, abstract problem of sequence match and optimized algorithm can be introduced accordingly. It is not sure yet whether this further abstraction of problem has any practical implication. However, it deserves a theoretical perspective.

VI. CONCLUSION

We presented a general problem of string match and its optimized algorithm inspired by the bad character shift rule of Boyer-Moore string search algorithm. The abstract nature of the problem allows us to include both single pattern string search and regular expression pattern match as its two special cases.

While the optimized algorithm discussed is not better

than Boyer-Moore type string search algorithms, it can be used for match optimization in other pattern problem such as regular expression pattern match or the problem of regular expression pattern match with checksum validation. One can even use it for many other pattern match problems beyond the scope of strings of characters such as sequence of objects, where the concept of object can be very general.

ACKNOWLEDGMENT

Special thanks to Joe Lin, the engineering site director at

Trend Micro for his support. Without his sponsorship, this research work will not be possible.

REFERENCES

[1] R. Boyer, J. Moore, "A fast string searching algorithm",

Comm. ACM vol 20, pp. 762–772., 1977

[2] R. Horspool, "Practical fast searching in strings", Software - Practice & Experience , vol.10 (6), pp. 501–506, 1980

[3] T. Raita, “Tuning the Boyer–Moore–Horspool String Searching Algorithm”, Software - Practice & Experience , vol 22(10), pp. 879–884, 1992

[4] D. Sunday, “Very Fast Substring Search Algorithm”, Comm. ACM, vol 33, issue 8, pp. 132-142 , 1990

[5] http://en.wikipedia.org/wiki/Luhn_algorithm.