succinct backward-dawg-matchingfredriks/pub/papers/jea09.pdf · succinct backward-dawg-matching ......

Succinct Backward-DAWG-Matching

KIMMO FREDRIKSSON

University of Kuopio

We consider single and multiple string matching in small space and optimal average time. Ouralgorithm is based on the combination of compressed self-indexes and Backward-DAWG-Matching(BDM) algorithm. We consider several implementation techniques having different space/time andimplementation complexity trade–offs. The experimental results show that our approach has muchsmaller space requirements than BDM, while being much faster and easier to implement. We showthat some of our techniques can boost the search speed of compressed self-indexes as well, usinga small amount of additional space.

Categories and Subject Descriptors: E.1 [Data structures]: ; E.2 [Data storage represen-tations]: ; E.4 [Coding and information theory]: Data compaction and compression; F.2.2[Analysis of algorithms and problem complexity]: Nonnumerical Algorithms and Prob-lems—Pattern matching, Computations on discrete structures; H.3.3 [Information storage andretrieval]: Information Search and Retrieval—Search process

General Terms: Algorithms, Performance

Additional Key Words and Phrases: String matching, multiple string matching, text indexing,compression

1. INTRODUCTION

Background and previous work. We address the well known exact string matchingproblem. The problem is to search the occurrences of the pattern P [1 . . . m] from thetext T [1 . . . n], where the symbols of P and T are taken from some finite alphabetΣ, of size σ. Numerous efficient algorithms solving the problem have been obtained.The first linear time algorithm (KMP) was given in [Knuth et al. 1977], and the firstsublinear average time algorithm (BM) in [Boyer and Moore 1977]. Many practicalvariants of BM family have been suggested, see e.g. [Horspool 1980; Sunday 1990].An average optimal O(n logσ(m)/m) time algorithm (BDM, see also Sec. 4) isobtained e.g. in [Crochemore and Rytter 1994], which also generalizes for searchingr patterns simultaneously in average time O(n logσ(rm)/m), which is again optimal[Navarro and Fredriksson 2004]. Aho-Corasick algorithm [Aho and Corasick 1975]

Supported in part by the Academy of Finland.A preliminary version of this paper appeared as Technical Report A-2007-1, University of Joensuu,Department of Computer Science and Statistics.Kimmo Fredriksson, University of Kuopio, Department of Computer Science, P.O. Box 1627,FI-70211 Kuopio, Finland. Email: [email protected] to make digital/hard copy of all or part of this material without fee for personalor classroom use provided that the copies are not made or distributed for profit or commercialadvantage, the ACM copyright/server notice, the title of the publication, and its date appear, andnotice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.c© 20YY ACM 0000-0000/20YY/0000-0001 $5.00

ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–1.

2 · K. Fredriksson

solves the same problem in O(n) worst case search time. Recently bit-parallelismhas been shown to lead to the most efficient algorithms for relatively short patterns,in practice. The first algorithm in this class was Shift-Or [Baeza-Yates and Gonnet1992; Wu and Manber 1992], which runs in time O(ndm/we) time, where w is thenumber of bits in computer word (typically 32 or 64). For m = O(w) this is a lineartime algorithm. Among the best algorithms is BNDM [Navarro and Raffinot 2000],which is a bit-parallel version of BDM. This achieves O(n logσ(m)/m) average timefor short patterns (m = O(w)), which is optimal. For long patterns the complexitybecomes O(ndm/we logσ(m)/m) = O(n logσ(w)/w) by simulating long computerwords, or O(n logσ(w)/w) by using only the first w symbols of the pattern as afilter. This works well, if O(nm/σw) = O(n logσ(w)/w), i.e. the verification timeis not dominating. These algorithms can be also generalized for multiple patterns,the best result being O(rn logσ(w)/w) average time [Hyyro et al. 2006]. This is notoptimal, but in practice performs well for small r and m.

All the above algorithms need to preprocess the pattern and require extra spacefor data structures. Usually the extra space is O(rm) or O(rmσ) words, dependingon the implementation details, or O(σ) for the bit-parallel algorithms. Usually elim-inating the σ factor means O(log min(m,σ)) additional factor in the search time.Another line of work is constant additional space algorithms [Galil and Seiferas1981; Crochemore and Perrin 1991; Crochemore 1992; Crochemore and Rytter1995]. These obtain O(n) worst case time for single patterns while using only O(1)words of additional space for the data structures, but also need the original patternas well, i.e. at least dm log2(σ)e bits.

For more references of exact string matching algorithm see e.g. [Navarro andRaffinot 2002; Crochemore and Rytter 1994; 2002].

Another approach for string matching is indexing (also called off-line searching).In this case one is allowed to preprocess the text, so that given a pattern all itsoccurrences can be counted in close to O(m) time, and located in close to O(m+occ)time, where occ is the number of occurrences. One such (classical) data structure isthe suffix tree [Weiner 1973], which informally is a path compressed trie of all thetext suffixes. This needs O(n) (or O(nσ)) words of space (i.e. at least O(n log2(n))bits), but the hidden constant can be very large. The text itself needs to be storedas well, taking another dn log2(σ)e bits. Suffix array [Manber and Myers 1993]is basically in its simplest form a sorted array of (pointers/indexes to) the textsuffixes, so that the pattern query becomes a binary search in that array. Theconstant factor is much smaller than for suffix trees, but for huge text collectionscan be still too large. Recent trend in text indexing is succinct or compressedindexes (also called self-indexes if the original text is not needed) [Ferragina andManzini 2000; Navarro and Makinen 2007]. These methods achieve space closeto the information theoretical lower bound. For more details see [Navarro andMakinen 2007] and Sec. 3.

Our contributions. Indexing methods obviously are more attractive as comparedto on-line searching. However, an index is not always available, and in some cases itis not even plausible to build one, e.g. the text might be inherently on-line, such asin intrusion detection applications. Still the number of patterns to be searched canbe huge, as in e.g. anti-virus scanners. In this paper we propose a method that com-ACM Journal Name, Vol. V, No. N, Month 20YY.

Succinct Backward-DAWG-Matching · 3

bines BDM (for single or multiple patterns) with compressed self-indexes, resultingin on-line string matching algorithm that has optimal average case search time andcan operate in small space. The small space complexity is important in moderncomputers that have high cache miss costs. It has been experimentally shown thate.g. AC algorithm, having O(n) worst case time for searching r patterns, has su-perlinear running time of the form c ·n ·f(r) in practice [Salmela et al. 2006], wherec is a positive constant. This is attributed to the high memory requirements. Ouralgorithms have space complexity close to the information theoretic lower boundof the pattern set. This is better than the complexity for the “constant space”algorithms, which need also the original pattern. Moreover, our algorithms workfor multiple patterns as well. One of the problems of the original BDM algorithm isthat the preprocessing (i.e. constructing a factor automaton) is hard to implement.Our approach has certain space/time trade–offs, depending on the implementation,and these also result in trade–offs on the implementation complexity. Some of thevariants are very easy to implement, but they still give better performance thanBDM algorithm in practice.

Organization. The paper is organized as follows. Sec. 2 gives the basic defini-tions. Sec. 3 reviews the indexing technique we are going to use in Sec. 4, whichdescribes BDM and how it can use the compressed indexes. Sec. 5 describes practi-cal implementation issues. Sec. 6 gives experimental results. Conclusions are givenin Sec. 7. A short preliminary version of this paper appeared as a technical report[Fredriksson 2007].

2. PRELIMINARIES

Let the text T [1 . . . n] and the pattern P [1 . . .m] be strings over a finite orderedalphabet Σ = 0, . . . , σ − 1. The exact string matching problem is to find alloccurrences of P in T . The pattern P occurs at position i of T , if P [j] = T [i+j−1]for 1 ≤ j ≤ m.

String S[1 . . . i] is a prefix of S, string S[i . . . n] is a suffix of S, and S[i . . . j] is asubstring (factor) of S. Any of these can be also an empty string.

The function rankc(S, i) for a sequence (string) S[1 . . . n] gives the number ofoccurrences of character c ∈ Σ in the prefix S[1 . . . i]. A special case arises whenσ = 2 (binary alphabet). In this case rank0(S, i) = i− rank1(S, i).

The zeroth-order empirical entropy of the string S is defined to be

H0(S) = −∑

s∈Σ

f(s)n

log2

(f(s)n

), (1)

where f(s) denotes the number of times s appears in S. The k-th order empiricalentropy is

Hk(S) =∑

X∈Σk

f(X)n

H0(SX), (2)

where X is a substring of S, and f(X) denotes the number of occurrences of Xin S, and SX is the concatenation of the symbols occurring in S just after thestring X. The string X is called the context of the following symbol. It holds thatHk(S) ≤ H0(S) ≤ log2(σ). See also [Ferragina and Manzini 2005, Appendix A].

ACM Journal Name, Vol. V, No. N, Month 20YY.

4 · K. Fredriksson

P# = mississippi# =⇒

1 12 #mississipp i

2 11 i#mississip p

3 8 ippi#missis s

4 5 issippi#mis s

5 2 ississippi# m

5 1 mississippi #

7 10 pi#mississi p

8 9 ppi#mississ i

9 7 sippi#missi s

10 4 sissippi#mi s

11 6 ssippi#miss i

12 3 ssissippi#m i

=⇒ P bwt = ipssm#pissii

Fig. 1. Burrows-Wheeler transforming the string P . Left: the original string; middle: the matrixM, left column shows the corresponding suffix array; right: P bwt (the last column of M).

3. COMPRESSED SELF-INDEXING

A full-text index is a data structure that can be used to find all occurrences of agiven pattern P from the (indexed) text T efficiently, i.e. without having to scanthe text T itself. Classical indexes are e.g. suffix tree [Weiner 1973] and suffix array[Manber and Myers 1993]. Both of these data structures require O(n log2 n) bitsof space, and especially for suffix tree the constant factor can be very large. Thelarge space requirements make these impractical for large text collections. However,in this work we are more interested in good cache performance for relatively smallinputs. One solution for the space problem was proposed by Ferragina and Manzini[Ferragina and Manzini 2000]. Their index (called FM-index by its authors) and itsmany variations [Navarro and Makinen 2007] have three main traits: (i) the spacecomplexity of the index is proportional to O(nHk(T )) bits (the k-order empiricalentropy of the text), plus some low-order terms depending on the variant; (ii) theindex can be used to retrieve any substring of the original text, i.e. the index cantotally replace the text, hence FM-index is often called a self-index; (iii) the usualsearch operations can still be performed in close to optimal time.

We now briefly review the FM-index, covering mainly the aspects that are neededin the this work. We also present the method as for indexing the pattern string P(not text), as this is how we will use it. The pattern index is based on Burrows-Wheeler transformation [Burrows and Wheeler 1994] of the original pattern, de-noted as P bwt = BWT(P ). Let # denote a special symbol that is lexicographicallysmaller than any other symbol in the alphabet. Then P bwt is obtained as follows(see also Fig. 1):

(1) Generate all cyclic shifts of the string P#.(2) Sort the generated strings into lexicographical order.(3) Assume that the sorted strings form the rows of a matrix M.(4) P bwt is the last column of M.

Note that the cyclic shifts and the matrix M need not to be explicitly generated,they are used just for the presentation.

Observe that the matrix M is effectively the suffix array for the pattern P , i.e.the rows of M contain all suffixes of P in lexicographical order, and hence anyACM Journal Name, Vol. V, No. N, Month 20YY.


Alg. 1 Count(S, h).

1 i← h; s← 1; e← |P |2 while s ≤ e and i > 0 do3 c← S[i]4 s← C(c) + rankc(P bwt, s− 1) + 15 e← C(c) + rankc(P bwt, e)6 i← i− 17 return e− s + 1

substring S can be searched from P by searching S with binary search from therows ofM. More precisely, one can use two binary searches to find the interval [s, e]such that the strings in rows Ms...e contain S as a prefix. The most remarkableaspect of BWT is that it is reversible [Burrows and Wheeler 1994], and hence alsothe matrix M can be obtained from P bwt only. The novelty of FM-index is touse only P bwt (in compressed form) to simulate the binary search in M withoutexplicitly constructing it. The key is a so called LF-mapping (Last-to-First), thatis, given a position i in the last column (L,L = P bwt) of M, LF (i) gives theposition of the corresponding character in the first column (F ) of M (note thateach column of M is a permutation of the string P#). To describe the mappingwe need the following definitions:

—C(c) = C(P, c) gives the total number of characters in P that are lexicographicallysmaller than c.

—Occ(c, i) = rankc(P bwt, i) gives the number of occurrences of character c inP bwt[1 . . . i].

It can then be shown that LF (i) = C(P bwt[i]) + Occ(P bwt[i], i). In particular, LF-mapping can be used to scan the original string P backwards, using the transformedstring P bwt. If P [j] = P bwt[i], then P [j − 1] = P bwt[LF (i)]. For more details andcorrectness, see [Ferragina and Manzini 2000; Navarro and Makinen 2007] and theexample in Sec. 4.1.1.

Using the LF-mapping, Alg. 1 can be used to count the number of occurrencesa string S[1 . . . h] has in the string P . After each step of the algorithm, the stringsin (conceptual) rows Ms...e have S[i . . . h] as a prefix, and hence e − s + 1 is thenumber of occurrences of S[i . . . m] [Ferragina and Manzini 2000]. Note that thestring S must be searched backwards due to the nature of LF-mapping. Othertypes of queries are of interest in many applications, such as locating the positionfor each occurrence, but we only need counting query in this paper.

Alg. 1 will be the basic building block in our subsequent algorithms. Its runningtime depends basically on the efficiency of rank(). If rank takes constant time, thenAlg. 1 takes O(|S|) worst case time. Note that in principle we do not need P bwt

in any explicit form, it is enough that rank() can be computed efficiently, using aslittle space as possible. This has been the central research issue in FM-indexing.See Sec. 5 for discussion on efficient rank implementation.


6 · K. Fredriksson

4. COMPRESSED SELF-INDEX BASED BDM AUTOMATON

Backward DAWG (Directed Acyclic Word Graph) Matching algorithm (BDM forshort) [Crochemore et al. 1994] is an average-optimal on-line string matching al-gorithm. The algorithm needs a method to recognize all factors of the pattern.More precisely, suffixes of the reverse pattern, i.e. prefixes of the original patternare enough. To describe the basics of the algorithm we just assume that we havea finite state automaton that recognizes the suffixes (and factors) of the input pat-tern. We note that such a suffix automaton can be built in O(m) time, for detailssee [Crochemore and Rytter 1994].

More precisely, we take the pattern in reverse, i.e. P r = pmpm−1 . . . p1, and buildan automaton that recognizes every suffix (including an empty string) of P r. Wenow show how this can be used for efficient search. Assume that we are scanning thetext window T [i . . . i + m− 1] backwards. The invariant is that all occurrences thatstart before the position i are already reported. The substring read in this windowis matched against the automaton. This is continued as long as the substringcan be extended, or we reach the beginning of the window. If the whole windowcan be matched against the automaton, then the pattern occurs in that window.Whether the pattern matches or not, some of the occurrences may still overlapwith the current window. However, in this case one of the suffixes stored into theautomaton must match, since the reverse suffixes are also the prefixes of the originalpattern. The algorithm remembers the longest such suffix, that is not the wholepattern, found from the window. The window is then shifted so that its startingposition will become aligned with the last symbol of that suffix. This is the positionof the next possible pattern occurrence. If the length of that longest suffix was `,the next window to be searched is T [i + m− ` . . . i + m− 1 + m− `]. The shiftingtechnique is exactly the same independent of whether or not the pattern occurs inthe current window. This process is repeated until the whole text is scanned. Fig. 2illustrates.

The algorithm runs in O(n logσ(m)/m) average time, which is optimal [Yao 1979].However, the worst case time is O(nm), but it is possible to improve it to O(n).

4.1 Using a self-index

We now propose several variants of BDM using self-indexing methods to implementthe factor recognition.

The simplest approach is to blindly try to mimic the working of BDM algorithm.Again assume that we are working with the reverse pattern P r. The idea is tobuild a self-index for P r and then use Alg. 1 to recognize the factors. The problemwith this approach is that the text window should now be read forwards, becausethe pattern was stored backwards. But we cannot afford scanning the text windowfrom the beginning, as this would not allow computing a BDM-type shift value.Instead, we can take the following approach. Assume that the current window isT [i . . . i + m − 1]. On average, the longest matching factor in window is of length` = Θ(logσ(m)) (assuming that each symbol occurs independently with probability1/σ). Therefore the algorithm can read the text substring t = T [i+m−1−` . . . i+m−1] forwards, using algorithm similar to Alg. 1. If no matches are found, then t isnot a factor of P , and the window can be safely shifted to T [i+m−` . . . i+m−`+m].ACM Journal Name, Vol. V, No. N, Month 20YY.


Current text window

T:

The longest recognized pattern prefix (u)

The longest recognized pattern factor (v)

x

Safe prefix based shift

Pattern aligned with the current windowP:

Safe factor based shift

Fig. 2. Pattern aligned against text window of length m. The text window is matched against thepattern factors until a mismatch (’x’). The longest recognized factor is v, but xv is not a factor,and hence P can be shifted past x. The longest suffix of v that is also a pattern prefix is u, so Pcan be shifted to align u with the pattern prefix.

Alg. 2 SBDM(T, n, P, m).

1 P bwt ← BWT(P )2 p← k | P bwt[k] = #3 i← 14 while i < n−m + 1 do5 j ← m; shift ← m; s← 1; e← m + 16 while s ≤ e and j > 0 do7 c← T [i + j − 1]8 s← C(P, c) + rankc(P bwt, s− 1) + 19 e← C(P, c) + rankc(P bwt, e)10 j ← j − 111 if s ≤ p ≤ e then12 if j > 0 then shift ← j else report match13 i← i + shift

Otherwise, the window is verified, and the window is shifted only by one character.It is easy to show that the average shift remains Θ(m), and hence this algorithmis still average-optimal, i.e. the average running time is O(n logσ(m)/m), but theconstant factor is somewhat larger than in BDM, since the shift is based on factors,and not (reverse) suffixes.

Better approach is to use P instead of P r, since this allows to scan the textbackwards, as in the original BDM. First note that building the self-index for Pmeans that the index contains the suffixes of P , while BDM is based on the factthe index contains the suffixes of P r, i.e. the prefixes of P ; see Fig. 3. We now showthat the algorithm can still work, even with this change. The main observation isthat if we want to recognize only the factors of P , it does not matter whether theindex is based on the suffixes of P or P r. In other words, we can use Alg. 1 to scanthe text window backwards, precisely as in plain BDM. To see this, first observethat the whole pattern is obviously recognized, as it is one of the rows of M, andthe text window is read backwards. But if the current range Ms...e is not empty,the current substring t of the current window must match a prefix of one of the


8 · K. Fredriksson

P# = acga# =⇒

1 5 #acg a

2 4 a#ac g

3 1 acga #

4 2 cga# a

5 3 ga#a c

=⇒ P bwt = ag#ac

=⇒read c

1 #acga

2 a#acg

3 acga#

4 cga#a

5 ga#ac

=⇒read a

1 #acga

2 a#acg

3 acga#

4 cga#a

5 ga#ac

=⇒ac is a prefix of P

=⇒read a

=⇒shift by 2

=⇒read a

1 #acga

2 a#acg

3 acga#

4 cga#a

5 ga#ac

=⇒a is a prefix of P

=⇒read a

=⇒shift by 3

=⇒read a

1 #acga

2 a#acg

3 acga#

4 cga#a

5 ga#ac

=⇒a is a prefix of P

=⇒read g

1 #acga

2 a#acg

3 acga#

4 cga#a

5 ga#ac

=⇒read c

1 #acga

2 a#acg

3 acga#

4 cga#a

5 ga#ac

=⇒read a

1 #acga

2 a#acg

3 acga#

4 cga#a

5 ga#ac

=⇒match

=⇒shift by 3...

Fig. 3. Top: Burrows-Wheeler transforming the (DNA) string P = acga. Left: the original string;middle: the matrix M, left column shows the corresponding suffix array; right: P bwt (the lastcolumn of M). The index of # in P bwt is p = 3 (See Alg. 2). Plain BDM would generate suffixesof P r, i.e. the strings agca, gca, ca, a, and ε (an empty string). Bottom: Simulation of Alg. 2 oninput text string T = gaacaacga.

suffixes of P , and hence it is a factor of P . That is, the scanning is done until (i)we reach the beginning of the window, in which case we have found an occurrenceof the pattern; or (ii), the range s . . . e becomes empty. In the case (i) we simplyshift the window by one character position. In the case (ii), we shift the windowpast the position that caused the range become empty. This algorithm is similarto our first attempt, and is still factor based, since it does not recognize the reversepattern suffixes. The only difference is that we do not have to use fixed-length textsubstrings, and hence should be in practice somewhat better.

Finally, the self-index allows an easy method to recognize if any of the recognizedfactors of P is also a prefix of P , corresponding to reverse suffix of P r, and hencethe original BDM can be simulated exactly. The key observation is that if for thecurrent range s . . . e any of the characters P bwt[s . . . e] include the special symbol#, i.e. the last symbol of P , then the range includes (a prefix of) P . We thereforestore the index p of the symbol # in P bwt (see Fig. 3), and after each step ofthe backward scanning we check if the range s . . . e is non-empty, and if p is inthat range. If so, the current substring of the text window matches a prefix of thepattern. This text position is recorded, and is used to shift the window preciselyas in the original BDM algorithm. Alg. 2 shows the complete pseudo code. As thealgorithm exactly simulates each step of BDM in O(1) time (assuming rank() takesACM Journal Name, Vol. V, No. N, Month 20YY.


O(1) time), its average case running time is the optimal O(n logσ(m)/m).

4.1.1 Example. Consider a pattern P = acga, depicted in Fig. 3, and assumewe have a text T = gaacaacga. BDM algorithm works with the suffixes of thereverse pattern P r, i.e. the algorithm builds an automaton recognizing the stringsagca, gca, ca, a, and ε (an empty string).

The first window of T to be considered contains the substring gaac, which is readbackwards. BDM first reads c, which matches the prefix of the suffix ca. Then ais read, and ca is recognized to match the suffix ca, so the algorithm remembersthis text position. Then a is again read, but now caa does not match any suffix, sothe window is shifted so as to align the first symbol of P with the remembered textposition, i.e. to align the pattern with the longest recognized suffix ca (which is aprefix of P in reverse). Hence the next text window contains the substring acaa.The backward scanning is started again, and this time the longest recognized suffixis a, so the window is shifted 3 positions, to align the pattern with the text substringacga, which is then found to match with P .

Alg. 2 makes precisely the same steps as BDM above; the only difference is howthe factors and suffixes of P r are recognized. Alg. 2 works with P bwt = ag#ac, andthe algorithm is initialized as s← 1, e← m + 1, i.e. every prefix of the rowsMs...e

match the empty strings. Again, the first text symbol read is c, so s ← C(P, c) +rankc(ag#ac, 0)+1 = 3+0+1 = 4, and e← C(P, c)+rankc(ag#ac, 5) = 3+1 = 4,so the rows 4 . . . 4 ofM contain c as a prefix. Fig. 3 illustrates. The next step is toread the symbol a, and we obtain s← C(P, a)+ranka(ag#ac, 3)+1 = 1+1+1 = 3,and e← C(P, a)+ranka(ag#ac, 4) = 1+2 = 3. Now the algorithm recognizes thatthe range 3 . . . 3 of rows contains the special symbol # in the last column, so acmust be a prefix of P . This is used to shift the pattern after the mismatch causedby the next symbol read. The algorithm proceeds as illustrated in Fig. 3.

Now consider again the first step. Note that the first column of M contains thesymbols of P in lexicographical order. Hence C(P, c)+ 1 gives the index of the firstrow that has c in the first column. This is the value of s after the first step. Thevalue of e is obtained by also noticing (by using rank) that c occurs only once inthe pattern, hence s . . . e = 4 . . . 4.

The second step tries to find all the rows that contain ac as a prefix. Assume nowthat we made a cyclic shift of the rows s . . . e to the right. For example, shiftingthe row 4 (cga#a) of M gives acga#. This is the only shift that contains c as asecond symbol. Hence, the only rows that can contain ac as a prefix, are those rowsthat contain the cyclic shifts and begin with the symbol a. This can be computedin similar way as in the first step. The correctness follows from the fact that thelexicographical order is preserved in cyclic shifts to the right and the definition ofthe Last-to-First mapping; for more details refer to [Ferragina and Manzini 2000;Navarro and Makinen 2007].

4.2 Multiple patterns

Alg. 2 can be easily generalized for matching a set P = P1,P2, . . .Pr of r patternssimultaneously. Basically the generalization is as in the case of plain BDM algo-rithm. For simplicity and w.l.o.g., assume that all patterns are of the same length.A simple solution is then to just concatenate all the patterns, appending a special


10 · K. Fredriksson

acgt#tata#acac =⇒

1 10 #acac#acgt#tat a

2 15 #acgt#tata#aca c

3 5 #tata#acac#acg t

4 9 a#acac#acgt#ta t

5 13 ac#acgt#tata#a c

6 11 acac#acgt#tata #

7 1 acgt#tata#acac #

8 7 ata#acac#acgt# t

9 14 c#acgt#tata#ac a

10 12 cac#acgt#tata# a

11 2 cgt#tata#acac# a

12 3 gt#tata#acac#a c

13 4 t#tata#acac#ac g

14 8 ta#acac#acgt#t a

15 6 tata#acac#acgt #

=⇒ acttc##taaacga#

Fig. 4. Burrows-Wheeler transforming the (DNA) pattern set P = acgt, tatas, acac. Left:the original concatenated pattern set string; middle: the matrix M, left column shows the corre-sponding suffix array (note that for our algorithm the order of the first three rows is insignificant);right: the last column of M.

symbol after each pattern, and then Burrows-Wheeler transform the resulting pat-tern, and using it as input for Alg. 2. We denote the resulting concatenation againas P . Fig. 4 illustrates. This special symbol can be the same (#) for all patterns, ordifferent for each pattern, if we want to distinguish them. Either way, the specialsymbol(s) must again be lexicographically smaller than any other symbol. Thistakes care that the function C(P, ·) works correctly, i.e. each pattern (the specialsymbol) adds one to the returned value for each symbol in the real alphabet. Themain loop of the algorithm remains almost the same as for a single pattern. Thewindow length is still obviously just m characters, corresponding to the length ofthe original patterns. It should be clear that this approach works correctly.

However, there is one non-trivial problem that we must solve. That is, thesimple method we used to detect if the range included a pattern prefix is now morecomplicated. Straight-forward generalization of the method would need O(r) timeper read text character, which is not acceptable. For each read text character thealgorithm needs to detect if the range s . . . e includes any of the r pattern prefixes.Assume that we have a bitvector B of length r(m+1), so that B[k] = 1 iff P bwt[k]is a special symbol, where P bwt is the Burrows-Wheeler transformed string of theconcatenated patterns. Our problem can then be stated as a one-dimensional rangequery in B; if B[s . . . e] includes at least one 1-bit, then in the range s . . . e thereis a prefix of at least one of the patterns. This query can be answered in constanttime if we have rank data structures built on B: iff

rank1(B, e)− rank1(B, s− 1) > 0

then there is at least one pattern prefix in the range. Note that B is in fact notneeded, as we can use P bwt directly, i.e. the above query can be stated as

rank#(P bwt, e)− rank#(P bwt, s− 1) > 0,

which works if the special symbol # is the same for all patterns. However, an easy“solution” is to just ignore whether or not some pattern prefix is in the range. ThisACM Journal Name, Vol. V, No. N, Month 20YY.


Alg. 3 MSBDM(T, n,P, r,m).

1 P ← P1 ·# · P2 ·# · . . . · Pr ·#2 P bwt ← BWT(P )3 i← 14 while i < n−m + 1 do5 j ← m; shift ← m; s← 1; e← r(m + 1)6 while s ≤ e and j > 0 do7 c← T [i + j − 1]8 s← C(P, c) + rankc(P bwt, s− 1) + 19 e← C(P, c) + rankc(P bwt, e)10 j ← j − 111 if rank#(P bwt, e)− rank#(P bwt, s− 1) > 0 then12 if j > 0 then shift ← j else report match13 i← i + shift

turns the method factor based, and the shifts become somewhat shorter (but notasymptotically), but in practice the simpler algorithm may be faster. The averagerunning time becomes O(n logσ(rm)/m), as with the generalized BDM [Crochemoreand Rytter 1994]. Alg. 3 shows the pseudo code.

5. RANK IMPLEMENTATION

The main motivation of using self-indexing methods to implement BDM is to makeit memory-efficient, in particular cache-friendly, which can have great speed impactin practice on modern hardware.

The function C() can be easily implemented with O(σ) words of space by usinga simple array. The main problem then is to implement rankc() (a.k.a. Occ()) sothat it is both fast and uses little space. The simple array based solution wouldneed O(σm) (or O(σrm)) words of space, which is the same as for plain BDM ifimplemented naıvely. Much work – both theoretical and practical – has been donefor efficient representation of rank structures, see e.g. [Jacobson 1989; Pagh 1999;Grossi et al. 2003; Kim et al. 2005; Gonzalez et al. 2005; Golynski et al. 2006;Ferragina et al. 2007; Okanohara and Sadakane 2007].

We do not go into the details of the various rank solutions. We just note some ofthe basic results. For binary sequence S[1 . . . n], rank can be solved in O(1) timeusing n+ o(n) bits of space, i.e. the sequence itself, plus o(n) bits for additional di-rectories and look-up tables. More complicated solutions exist, and one can achievenH0(S)+o(n) total space as well. For larger alphabets the simplest solution wouldbe to use σ bit-vectors B0...σ−1 where the ith bit of Bc is 1 iff S[i] = c, and use thebinary rank solutions. Wavelet tree [Grossi et al. 2003] is a more elegant solutionthat uses only nH0(S)+o(n log2(σ)) bits of space, but rankc takes O(log(σ)) time.The query time can be improved to O(log(σ)/ log log(n)) [Ferragina et al. 2007]while keeping the same space complexity. Note that efficient rank implementationfor binary alphabets is relatively easy, but for general alphabets it is still difficult.

However, in our case we do not necessarily need the most succinct possible so-lution, since the data structures surely fit into the main memory. However, wewould like that the rank structures fit into the CPU cache. Still, the query time



is very important in our case. One should also keep in mind that the O(1) timesolutions can have large constants in practice, and for large n the cache effects canplay important role [Gonzalez et al. 2005].

5.1 Practical implementation of SBDM

The practical performance of Alg. 2 and Alg. 3 depends on the implementation ofthe inner loop. In particular, the performance depends on the rank implementation.Since rank is hard to implement so that it is both fast and uses little space, we tryto avoid it as much as possible.

One possible course is to precompute the steps taken by the algorithm by thefirst b chars read in a text window, and at the search phase to use a look-up tableto perform the steps in O(1) time, and then continue the algorithm normally. Thisimproves the performance in practice, and can improve the overall complexity aswell, for large enough b.

More precisely, let us have a table G[0 . . . σb − 1], indexed by a concatenation ofb (consecutive) symbols (of T ), i.e. a text substring of length b. Each G[u] storesa triplet (s, e, shift), which is the result of running the Count algorithm (the innerloops of Alg. 2 and Alg. 3) for the substring u. Note that by storing the shift value,the algorithm still makes exactly the same shifts as before, even if it is possible thatmore symbols are read per text window; i.e. the basic algorithm may stop scanningthe text window before b symbols are read.

Let us analyze a simplified algorithm, that performs worse than the real one, andhence the complexity of the simple algorithm upper bounds the real one. To thisend, assume that we always read exactly b text symbols in each window. We havetwo cases. (i) If the text string matches inside P, the window is verified against allthe r patterns, and shifted by one position. Assuming that the probability of tworandomly picked characters matching is 1/σ, the probability of verification is atmost rm × 1/σb, and verification cost is at most O(rm). Hence the complexity ofthis case is at most O(n(rm)2/σb). We require that this is at most O(n/m), whichgives

b ≥ logσ(r2m3) = O(logσ(rm)).

(However, the analysis is pessimistic, and slightly smaller b may suffice.) (ii) Other-wise the window is shifted by m−b positions, giving the complexity of O(nb/(m−b)),which is O(n logσ(rm)/m), given the value of b from the case (i). This matchesthe lower bound [Yao 1979], and is thus optimal. Note that in RAM model ofcomputation we can read a substring of length O(logσ(rm)) in O(1) time, whichgives O(n/m) total average time. This breaks the lower bound, based on comparingsingle characters. However, the method is not based on comparing single charactersand it effectively avoids the logσ(m) term by “comparing” b symbols at a time. Onthe other hand, it is easy to see that increasing b beyond O(logσ(rm)) does notimprove the algorithm. The problem is that the space is O(σb) = O(r2m3), whichis too much.

In practice we can choose e.g.

b = α logσ(rm)ACM Journal Name, Vol. V, No. N, Month 20YY.


for some constant α < 1, to use only

O((rm)α) = o(rm)

words of additional space. The time improves only by a constant factor, hence theaverage case complexity remains O(n logσ(rm)/m). Note that this works for plainBDM just as well. However, for e.g. ASCII alphabets (σ = 256) we have practicallytwo choices, b = 1 or b = 2, since otherwise the space becomes too large. We callthis technique bucketing, as similar method with that name was used in [Manberand Myers 1993] to speed up searching in suffix arrays.

One possibility to side step the space complexity is to remap the text charactersread. In the case of ASCII alphabets σ = 256, but only a small fraction of thealphabet usually appears in P . Assume that P (or P) contains only σp distinctcharacters. The characters that appear in P are mapped to the interval 0 . . . σp −1, and the rest of the characters to a value σp. The space complexity is thenreduced to only O((σp+1)b) words, which in practice should be a large improvement.The drawback of this method is that a substring cannot be accessed in O(1) timeanymore, because of the character mapping, but it takes O(b) time instead (butthis is still optimal in the comparison model).

The above idea can be still improved. As we already remap the characters, wecan as well use entropy bounded prefix codes, such as Huffman codes [Huffman1951]. I.e. we can compute Huffman codes for the characters of P , add one codefor the characters that do not appear in P , and use these codes for remapping andindexing G. Note that we do not have to Huffman code the pattern at all, theconcatenated codes are used just for indexing G. The average code length dependson the distribution of the text symbols, which may or may not be the same asfor the pattern. Making the assumption that the distributions are the same, theaverage code length is at most H = H0(P ) + 1 = H0(T ) + 1 bits. As the codeshave variable length, we will not read a fixed number of symbols in a text window,but rather we use b as the maximum length for the concatenated codes in bits.We therefore concatenate as many code words as can fit into b bits. Note that byusing Huffman coding we are making the implicit assumption that the characterdistribution is not uniform, as otherwise H0(P ) = log2(σ). This also means that theaverage number of characters inspected is more than logσ(rm), that is, assumingthat the probability of two characters matching is p, where 1/p < σ, then thealgorithm reads about log1/p(rm) characters per text window, and the encodedlength of this string is about b = H log1/p(rm) bits. But the space needed for thetables becomes too much in this case, i.e. we have 2b = Ω(rm) words. Again, wecan adjust the space-time trade-off by using smaller b, i.e. b = αH log1/p(rm), forsome α < 1/(H log1/p(2)) and obtain only o(rm) words of additional space.

Observe that using Huffman codes is still quite simple and efficient in practice.The only problem is that if the symbol distribution is very biased, the longestcodeword can have Θ(σ) bits. However, it is possible to use length-limited coding(which has only a small impact to the compression performance, in practice) [Wittenet al. 1999, pp. 396–405], or any other entropy bounded prefix coding that does nothave this problem (e.g. [Fredriksson and Nikitin 2007]). Still another possibility is toapply the bucketing technique only for short enough codewords, and revert to plain



rank for the (extremely) infrequent symbols. In particular, if the maximum codelength is limited to O(log2(σ)) bits, obtaining the codeword for a single symbol costsonly O(1) time using a look-up table of size O(σ). This should still be much fasterthan computing the corresponding rank queries, in practice. Another problem isthat by reading b bits in a text window may read more than O(log(rm)) symbols,if the symbols are very frequent, and thus their codes are very short. We do notexpect this to be a problem in practice, but nevertheless, it is easy to avoid; if atsome point we have read more than c log1/p(rm) symbols in a window (for someconstant c > 1), we can abort the scanning, and revert to the basic algorithm (forthat window).

5.2 Off-line searching

The techniques of Sec. 5.1 can be used for improving Alg. 1 for its original purposeas well, i.e. the counting queries for text indexing. The analysis remains the sameif we replace rm with n, where n is the length of the indexed text. As shown inthe experimental results, using small amount of extra space speeds up the countingqueries substantially.

6. EXPERIMENTAL RESULTS

We have implemented the algorithms in C, and compiled with icc 9.1. The exper-iments were run on Pentium4 2.4GHz with 512Mb of RAM, 512kb cache, runningGNU/Linux 2.6.17 operating system.

We experimented with three rank implementations, having different time/spacetrade–offs. These were:

GiganticRank (GR):. A two–dimensional array of r(m+1)×σ integers; rankc(S, i)can be computed with a single table look-up. The additional space is r(m+1)σ in-tegers, which is the same order (but practically less) as for the plain BDM algorithmimplemented with tables.

HugeRank (HR):. Two two–dimensional arrays. Conceptually, the text is dividedto 256 blocks, and the exact rank value is computed for each symbol and positionwithin its block, thus taking r(m+1)×σ bytes of space in total over all the blocks.The second table is r(m + 1)/256 × σ integers, storing the exact rank value forthe whole text, but only for text positions corresponding to the beginnings of theblocks. Hence rankc(S, i) can be computed with two table look-ups. In practiceabout one fourth of the space used by GiganticRank.

BitRank (BR):. An array of σ bit-vectors, each of length r(m + 1) bits, i.e. thebit-vector for symbol c has its ith bit set if S[i] = c. Rank queries are solved usingrank1 queries on the bit-vectors. The space is r(m + 1)σ(1 + o(1)) bits.

Wavelet (W):. A variant of wavelet tree [Grossi et al. 2004] taking r(m+1)(H0+1)(1 + o(1)) bits, where H0 is the 0-order entropy of the pattern set.

Note that the sizes are for the worst case, i.e. rank is built only for the symbolsthat actually appear in the pattern sets. The query time for the Wavelet treeis O(log2(σ)), and O(1) for the others, but the constants have a large impact inpractice.ACM Journal Name, Vol. V, No. N, Month 20YY.


We used the DNA (σ = 16, includes some special symbols), protein (σ = 25)and English (σ = 215) texts available from http://pizzachili.dcc.uchile.cl/,truncated to 100MB. The patterns were randomly picked from the texts.

We implemented the Burrows-Wheeler transform using C standard library func-tion qsort(). This is very slow as compared to the state-of-the-art suffix sortingalgorithms, that achieve O(n) (that is, O(r(m + 1)) time), but for the total timethe difference is negligible for pattern sets that are reasonably small as comparedto the text size. Moreover, using qsort() makes the implementation extremelysimple, and together with one of the simpler rank solutions the resulting algorithmis very simple as compared to implementing traditional BDM algorithm.

For a comparison, we used the Backward Set Oracle Matching (BSOM) algo-rithm [Allauzen and Raffinot 1999; Navarro and Raffinot 2002] (implemented by itsauthors). This is a simplified version of multiple BDM algorithm, but it has beenexperimentally shown that BSOM is always faster than BDM [Navarro and Raf-finot 2002]. The implementation of BSOM uses tables to implement the automatonstates, i.e. the space complexity is O(rmσ) words1. It also uses Aho-Corasick (AC)[Aho and Corasick 1975] automaton for verification. This is efficient for small pat-tern sets, but for large rm the large space (and therefore also preprocessing time)requirement makes it impractical. We used Aho-Corasick (AC) automaton [Ahoand Corasick 1975] as another control point. AC runs in O(n) worst (and best)case time. The implementation uses a full automaton, and the space complexity isthe same as for BSOM, but in practice the memory used is only 50% of the BSOMimplementation (which uses AC for verifications).

We also compared against the algorithms given in [Salmela et al. 2006] (using theiroriginal implementations). They give several q-gram based filtering algorithms. Ofthese, the HG and BG algorithms were generally the most efficient, BG beingusually slightly faster. In brief, BG is BNDM variant that uses “superimposing”to search the whole pattern set in parallel, and q-grams to effectively increasethe alphabet size to σq. The use of q-grams makes the superimposing trick workwell for large pattern sets, with a cost of using more space. The average filteringtime is O(n log1/p(m)/m), where p depends on r and q [Salmela et al. 2006]. Theverification uses an improved version [Muth and Manber 1996] of the Karp-Rabinalgorithm [Karp and Rabin 1987]. This is simple method but is in practice veryeffective. The implementations support only pattern lengths of 8, 16 and 32 (specialcode for all) in general, except that for DNA they have a special code that supportsm = 32 only. Therefore we give the experimental results for these algorithmsseparately.

Figs. 5, 6 and 7 show the performance of the algorithms as megabytes / second(the number of text characters, measured in megabytes, that are scanned per secondby the algorithms) for r ∈ 10, 100, 1000, 10000, and m ∈ 8, 16, . . . , 64. Thepattern lengths were the same for each experiment. Note that if the pattern lengthswere different, the search performance would be limited by the shortest pattern inthe set, as it defines the maximum shift. HR, GR, BR and W denote our algorithmwith different rank implementations, and the suffix ’B’ denotes that ’bucketing’ is

1In their implementation, the σ factor is in fact σ + 6, which may show in in practice for smallalphabets.



used as well, as detailed in Sec. 5.1 (the bucketing method still reads the text onecharacter at a time). In all cases we remap the alphabet to 1 . . . σp, where σp is thenumber of distinct symbols in P. We used the maximum bucket size so that thespace is smaller than the rank size. Hence the total space used by the algorithms isalways less than 2× rank-size. Plain AC and BSOM do not use alphabet mapping.This means that even GR uses much less space than AC or BSOM in practice,because the effective alphabet size is reduced. That is, e.g. for DNA the effectivealphabet size is σ = 16 (a, c, g, t and some special symbols), but by using 7-bitASCII alphabet for DNA the actual alphabet size is 128. For fairness, we modifiedBSOM and AC to use the alphabet mapping trick as well. This turns the σ factorin the space complexities to σp. These versions are denoted by AC AM and BSOMAM.

AC is competitive only for very small r, m and σ, and BSOM only for small rand m and large σ. Alphabet mapping improves both AC and BSOM, BSOM AMbecoming the fastest algorithm for r = 10 on proteins and English texts. Althoughnot competitive here, an illustrative example of the effect can be seen in Englishtext, r = 10000, when BSOM degrades sharply when m > 32, but BSOM AM stillworks well. For small pattern sets HR and GR (with buckets) are generally veryfast (note that these are also the easiest to implement), but BR becomes more andmore competitive as the size of the pattern set increases. For large sets the cacheeffects come to play, and this slows down the algorithms using large amounts ofmemory. W is usually the slowest algorithm, but even this beats AC, AC AM andBSOM for large r, m and σ, but not BSOM AM.

Figs. 8 and 9 show the performance of BG against BSOM AM and the best ofour algorithms for proteins and English texts. The algorithms are very efficientfor r = 10 and r = 100, being clearly the fastest. This is due to the simplicity ofthe (bit-parallel) BG algorithm, and that using only q = 2 is sufficient. For largerpattern sets we were forced either to use q = 3 (maximum value supported, and form = 8 only) which helps to reduce the number of verifications, with a cost of morecomplicated algorithm (uses hashing) and more space, or to use the (automated)partitioning into smaller subsets. This agrees with the original results, that theperformance is good as long as 2-grams are sufficient; after that using 3-grams andpartitioning to subsets helps, but mainly compared to the results against 2-grams.For large r our method outperforms BG. Note that in general using larger q wouldin principle help the filter, but the the memory requirements grow quickly too high,and/or would result in poor cache performance. On the other hand, using hashingto reduce the memory requirements makes accessing the data structures relativelyslow. Note that the performance of BG depends on a large enough q; too small qmakes the algorithm effectively degenerate into a verification. Note also that forlarge alphabet sizes finding a suitable q is hard in practice, as the value must be aninteger. There is a fundamental difference to our bucketing method, which reducesthe number of rank queries; the algorithms are not depending on it and performwell even without it.

The situation is quite different for DNA. They have special code tuned for DNA,but this supports only m = 32, and σ = 4. In particular, this does not supportthe special symbols of our DNA file (which has σ = 16), and hence we used a 4ACM Journal Name, Vol. V, No. N, Month 20YY.


0

100

200

300

400

500

600

700

800

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 10

0

100

200

300

400

500

600

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 100

0

50

100

150

200

250

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 1000

0

5

10

15

20

25

30

35

40

45

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 10000

HRGRBR

WHR, BGR, B

BR, BW, B

BSOM

BSOM AMAC

AC AM

Fig. 5. MB/s for DNA. The plots are for r ∈ 10, 100, 1000, 10000, each showing the performanceof our algorithms against BSOM and AC (with and without alphabet mapping) for m = 8 . . . 64..



0

200

400

600

800

1000

1200

1400

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 10

0

100

200

300

400

500

600

700

800

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 100

0

50

100

150

200

250

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 1000

0

5

10

15

20

25

30

35

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 10000

HRGRBR

WHR, BGR, B

BR, BW, B

BSOM

BSOM AMAC

AC AM

Fig. 6. MB/s for proteins. The plots are for r ∈ 10, 100, 1000, 10000, each showing the per-formance of our algorithms against BSOM and AC (with and without alphabet mapping) form = 8 . . . 64..



0

100

200

300

400

500

600

700

800

900

1000

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 10

0

50

100

150

200

250

300

350

400

450

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 100

0

20

40

60

80

100

120

140

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 1000

0

5

10

15

20

25

30

35

8 16 24 32 40 48 56 64

× M

B /

seco

nd

m

r = 10000

HRGRBR

WHR, BGR, B

BR, BW, B

BSOM

BSOM AMAC

AC AM

Fig. 7. MB/s for English. The plots are for r ∈ 10, 100, 1000, 10000, each showing the per-formance of our algorithms against BSOM and AC (with and without alphabet mapping) form = 8 . . . 64..



0

200

400

600

800

1000

1200

1400

1600

8 16 24 32

× M

B /

seco

nd

m

r = 10

GR, BBSOM AM

BG

100

200

300

400

500

600

700

8 16 24 32

× M

B /

seco

nd

m

r = 100

HR, BBSOM AM

BG

20

40

60

80

100

120

140

160

180

200

8 16 24 32

× M

B /

seco

nd

m

r = 1000

HR, BBR, B

BSOM AMBG

0

5

10

15

20

25

30

35

8 16 24 32

× M

B /

seco

ndm

r = 10000

GR, BBR, B

BSOM AMBG

Fig. 8. MB/s for proteins. The plots are for r ∈ 10, 100, 1000, 10000, each showing the perfor-mance of the best of our algorithms against BG for m ∈ 8, 16, 32.

0

200

400

600

800

1000

1200

1400

8 16 24 32

× M

B /

seco

nd

m

r = 10

GR, BBSOM AM

BG

0

100

200

300

400

500

600

700

8 16 24 32

× M

B /

seco

nd

m

r = 100

HR, BBSOM AM

BG

0

10

20

30

40

50

60

70

80

90

100

8 16 24 32

× M

B /

seco

nd

m

r = 1000

HR, BBSOM AM

BG

2

4

6

8

10

12

14

16

18

20

22

8 16 24 32

× M

B /

seco

nd

m

r = 10000

BR, BBSOM AM

BG

Fig. 9. MB/s for English. The plots are for r ∈ 10, 100, 1000, 10000, each showing the perfor-mance of the best of our algorithms against BG for m ∈ 8, 16, 32.

MB E.coli DNA sequence. In this case the performance of BG is as follows (using8-grams): 1106 MB/s (r = 10), 885 MB/s (r = 100), 402 MB/s (r = 1000) and 84MB/s (r = 10000). This beats all our algorithms with a wide margin. We triedalso even larger pattern sets (up to r = 100000), but the gap against us remainedsimilar. Using the generic code, supporting larger alphabets (but limiting q to 3),BG beats our algorithms only for r = 10 (588 MB/s).ACM Journal Name, Vol. V, No. N, Month 20YY.


Table I. Bucket sizes for r = 10000. The format is mi . . . mj : b, where mi and mj denote thepattern length range, and b is the number of symbols in the bucket.

GR HR BR W

DNA 8 . . . 64 : 7 8 . . . 64 : 6 8 . . . 48 : 5, 56 . . . 64 : 6 8 . . . 16 : 4, 24 . . . 64 : 5

proteins 8 . . . 64 : 4 8 . . . 64 : 4 8 . . . 48 : 3, 56 . . . 64 : 4 8 : 2, 16 . . . 64 : 3

English 8 . . . 64 : 3 8 . . . 64 : 3 8 . . . 64 : 2 8 : 1, 16 . . . 64 : 2

Table II. Size ratios for r = 10000. For our algorithms the values are computed as(rank size + bucket size)/|P|. The reported values are minimum and maximum values corre-sponding to Table I. For BSOM and AC the values are computed as (automaton size)/|P|, andare about 1072 (DNA and proteins) and 2096 (english) for BSOM, and 536 and 1048 for AC,correspondingly. For BSOM AM the reported values are the minimum and maximum values overpattern lengths 8 . . . 64. For AC AM the values are about 50% of those for BSOM AM.

GR HR BR W BSOM AM

DNA 26.1 . . . 57.8 6.37 . . . 15.0 1.10 . . . 2.39 0.63 . . . 1.01 106 . . . 136

proteins 97.9 . . . 105 28.3 . . . 45.5 4.13 . . . 8.20 0.88 . . . 1.38 248 . . . 256

English 523 . . . 684 147 . . . 206 18.7 . . . 27.3 0.93 . . . 1.55 920 . . . 1272

The space usage of the algorithms using the bucketing technique is reported inTables I and II. Table I shows the number of symbols used by the buckets fordifferent rank structures and pattern lengths for r = 10000. Table II shows theratio of the space of the rank structures and the bucket tables against the size ofthe pattern set, corresponding to Table I. GR and HR are far from being “succinct”,but even these take much less memory than BSOM AM. BR is very attractive forDNA and proteins, taking into account the searching performance. W is the onlyone taking less space than P. Note that without the bucketing the numbers wouldbe 1 . . . 2 times smaller. For BG the space is not directly depending on the size ofthe pattern set, but is O(σq). In practice this means that the space is about 256kBfor DNA; 64kB, 128kB or 256kB (for pattern lengths of 8, 16, and 32) for proteinsand English texts when q = 2; about 16MB for proteins and English when q = 3.Hence BG can use relatively small or large amounts of space, depending on thenumber of patterns.

We also measured the number of cache misses using the Cachegrind tool, whichis a part of a larger tool Valgrind http://valgrind.org. Cache effects play animportant role in algorithms’ practical performance. Modern computers have twolevels of cache, for both instructions and data. In our case we are only interestedin data cache, and the the size of the first level (L1) cache is only 8,192 bytes, andthe second level cache (L2) is 524,288 bytes. The line sizes are 64 bytes for both,and associativities are 4 and 8 for L1 and L2, respectively. The cost of L1 miss isabout 10 clock cycles, and the cost of L2 miss can be up to 200 clock cycles. EvenL1 miss is relatively expensive. In addition, Translation Lookaside Buffer (TLB)is a small associative memory that translates virtual memory addresses to physicaladdresses. (Some CPUs support this in hardware, but even so, typical operationsystem implementations use a hybrid software / hardware approach for greaterflexibility.) A TLB miss costs typically 10–30 cycles, but this depend linearly onthe amount of allocated memory. However, TLB misses are not visible at instructionlevel, and hence Cachegrind cannot detect them. Consequently, we do not includethe TLB misses in our experiments.

We measured the cache miss rates for data read accesses for L1 and L2 caches.ACM Journal Name, Vol. V, No. N, Month 20YY.


The miss rates are computed as (cache read misses / all read memory accesses) ×100%. The reported values are for the whole runs of the algorithms, i.e. the val-ues include also building the data structures and all text accesses. The results forr ∈ 100, 10000 and m = 16 are reported in Table III. The very small L1 cacheshows in the results for BSOM; as the automaton states (the out going edges) areimplemented as arrays of size O(σ), only a few states fit in the L1 cache, and thecache performance is poor even for relatively small pattern sets. Our algorithmsperform much better, even for GR and HR, thanks to the alphabet mapping. Al-phabet mapping clearly reduces the cache misses for BSOM, and this shows in thepractical performance as well. We did not measure the values for AC, as it wasnot competitive in any of the timings. In general, the smaller data structures weuse, the less cache misses occur, as expected. However, there are several interest-ing things to note: (i) the smaller the alphabet (or the larger probability of twosymbols matching) is, the larger the number of characters read in each window is,and in some cases this shows also in the number of cache misses, because in thesecases the data structures are accessed more. (ii) the bucketing technique reducesthe number of cache misses for large pattern sets, even when the data structuresbecome larger with bucketing. However, the buckets use less memory than therank data structures, and if the rank computations can be avoided, the locality ofreference is improved. (iii) GR results in slightly smaller number of cache missesin some cases, as compared to HR, even if GR consumes more memory; this mightbe because of the fact that GR uses only one array for rank (HR uses two), andthis might result in a slight win. (iv) BG has relatively large cache miss rates in allcases, and in particular for r = 10, even when it is very efficient; the simplicity ofthe bit-parallel algorithm seems to make up for the relatively bad cache behaviour.Finally, we note that by using Wavelet trees the ratio for L2 cache misses is 0.0%in all cases.

We also experimented with the bucketing technique for indexing the texts. Inthis case we used only the wavelet tree to implement rank, as for indexing thetext the space complexity is more important. The results are shown in Table IV.Note that we implemented only counting query, other queries (such as display andlocate) would need some additional space. “Ratio” shows the rank size divided bythe original text size. The α values show the extra space used for bucketing, i.e. thetotal size is |T |× ratio× (1+α), and the b values are the corresponding bucket sizes(the number of symbols). “P/s” denotes the number of randomly picked patterns(m = 8) queried per second. Note that every pattern appears in the text at leastonce (which affects the speed). The main observation is that using very littleadditional space can give significant performance boost. For example, using lessthan 4% of additional space doubles the speed in the case of proteins. This wouldalso depend on the pattern lengths and the length of the longest pattern suffixappearing on the text. Moreover, the α values of course depend on the text sizeas well; in principle for larger texts we can obtain larger speed-ups using the sameor smaller α values. In other words, the attractiveness of this method increases forlarger texts, counting both space and time.ACM Journal Name, Vol. V, No. N, Month 20YY.


Table III. Cache miss rates in percentages. The numbers are computed as(cache read misses / all read memory accesses) × 100%. The first value is the ratio for L1data cache misses, and the second for L2 data cache misses. BG uses q = 8 for DNA, and forproteins and English q = 2. Top: without bucketing; bottom: with bucketing technique (nodifference for BSOM or BG).

r = 100, m = 16, without bucketing

GR HR BR W BSOM BSOM AM BG

DNA 7.6, 0.0 1.0, 0.0 0.0, 0.0 0.0, 0.0 42.3, 0.7 18.9, 0.3 14.4, 4.3

proteins 8.4, 0.0 4.0, 0.0 0.8, 0.0 0.4, 0.0 28.6, 1.9 25.0, 0.9 9.7, 3.4

English 9.2, 0.0 4.5, 0.0 1.6, 0.0 1.0, 0.0 37.3, 1.7 27.6, 0.8 12.7, 2.5

r = 100, m = 16, with bucketing


DNA 5.4, 0.0 1.7, 0.0 0.0, 0.0 0.0, 0.0 42.3, 0.7 18.9, 0.3 14.4, 4.3

proteins 6.9, 0.2 4.4, 0.1 2.1, 0.0 0.3, 0.0 28.6, 1.9 25.0, 0.9 9.7, 3.4

English 7.1, 0.1 4.6, 0.0 1.5, 0.0 0.8, 0.0 37.3, 1.7 27.6, 0.8 12.7, 2.5

r = 10000, m = 16, without bucketing


DNA 14.9, 5.0 14.5, 0.6 9.2, 0.0 6.2, 0.0 59.1, 37.9 31.7, 17.0 9.8, 0.8

proteins 12.7, 4.8 13.4, 2.1 9.7, 0.1 8.8, 0.0 48.7, 19.1 37.0, 14.9 14.9, 0.0

English 14.4, 6.0 14.7, 2.0 9.8, 1.1 8.6, 0.0 56.2, 31.4 41.3, 22.5 14.3, 0.0

r = 10000, m = 16, with bucketing


DNA 9.2, 5.6 8.3, 1.0 6.1, 0.0 4.8, 0.0 59.1, 37.9 31.7, 17.0 9.8, 0.8

proteins 5.7, 2.9 6.5, 2.1 6.6, 0.3 7.3, 0.0 48.7, 19.1 37.0, 14.9 14.9, 0.0

English 10.7, 5.7 10.4, 2.1 8.2, 0.1 7.6, 0.0 56.2, 31.4 41.3, 22.5 14.3, 0.0

Table IV. Indexing performance using buckets consuming (rank size)×α bytes of space; ratio isthe compression ratio (rank size per text size); bucket size in symbols is denoted by b, and p/s isthe number of randomly picked patterns queried per second (m = 8).

DNA ratio: 55.1% α = 0.00004 α = 0.00068 α = 0.0116 α = 0.197

p/s 83300 94300 98000 107000 132500

b 0 2 3 4 5

proteins ratio: 91.4% α = 0.000056 α = 0.00147 α = 0.0381 α = 0.992

p/s 41500 49000 62900 84800 118300

b 0 2 3 4 5

english ratio: 96.8% α = 0.000017 α = 0.00368 α = 0.794

p/s 35500 35600 40300 49000

b 0 1 2 3

7. CONCLUSIONS

We have obtained average-optimal on-line string matching algorithm for single andmultiple patterns that uses space close to the information theoretical lower bound.The algorithm has two main traits: (1) for large pattern sets the small memoryrequirements provide good performance in practice, as the data structures fit betterinto the CPU cache; (2) it is much easier to implement than traditional BDM,



depending on the rank structures used. Finally, we showed that using small amountof extra space can speed up the indexing performance as well.

Acknowledgments

We wish to thank Leena Salmela for generously sharing the source code of BG. Thesuggestions by the anonymous reviewers helped to improve this paper significantly.

REFERENCES

Aho, A. V. and Corasick, M. J. 1975. Efficient string matching: an aid to bibliographic search.Commun. ACM 18, 6, 333–340.

Allauzen, C. and Raffinot, M. 1999. Factor oracle of a set of words. Technical Report 99-11,Institut Gaspard-Monge, Universite de Marne-la-Vallee.

Baeza-Yates, R. A. and Gonnet, G. H. 1992. A new approach to text searching. Commun.ACM 35, 10, 74–82.

Boyer, R. S. and Moore, J. S. 1977. A fast string searching algorithm. Commun. ACM 20, 10,762–772.

Burrows, M. and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech.Rep. 124, Digital Equipment Corporation.

Crochemore, M. 1992. String-matching on ordered alphabets. Theor. Comput. Sci. 92, 1, 33–47.

Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W.,and Rytter, W. 1994. Speeding up two string matching algorithms. Algorithmica 12, 4/5,247–267.

Crochemore, M. and Perrin, D. 1991. Two-way string-matching. J. Assoc. Comput.Mach. 38, 3, 651–675.

Crochemore, M. and Rytter, W. 1994. Text algorithms. Oxford University Press.

Crochemore, M. and Rytter, W. 1995. Squares, cubes and time-space efficient string-searching.Algorithmica 13, 5, 405–425.

Crochemore, M. and Rytter, W. 2002. Jewels of Stringology. World Scientific. ISBN 981-02-4782-6. 310 pages.

Ferragina, P. and Manzini, G. 2000. Opportunistic data structures with applications. InProceedings of the 41st Annual Symposium on Foundations of Compute r Science (FOCS2000). IEEE Computer Society, Washington, DC, USA, 390–398.

Ferragina, P. and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552–581.

Ferragina, P., Manzini, G., Makinen, V., and Navarro, G. 2007. Compressed representationsof sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3, 2, article 20.

Fredriksson, K. 2007. Succinct pattern matching automata. Report A-2007-1, Department ofComputer Science, University of Joensuu.

Fredriksson, K. and Nikitin, F. 2007. Simple compression code supporting random accessand fast string matching. In Proc. 6th Workshop on Efficient and Experimental Algorithms(WEA’07). LNCS 4525. Springer–Verlag, 203–216.

Galil, Z. and Seiferas, J. 1981. Linear-time string matching using only a fixed number of localstorage locations. Theor. Comput. Sci. 13, 3, 331–336.

Golynski, A., Munro, I., and Rao, S. S. 2006. Rank/select operations on large alphabets: atool for text indexing. In Proceedings of SODA’06. ACM Press, 368–373.

Gonzalez, R., Grabowski, S., Makinen, V., and Navarro, G. 2005. Practical implementationof rank and select queries. In Poster Proceedings Volume of 4th Workshop on Efficient andExperimental Algorithms (WEA’05). CTI Press and Ellinika Grammata, Greece, 27–38.

Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. InProc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). ACM, 841–850.



Grossi, R., Gupta, A., and Vitter, J. 2004. When indexing equals compression: Experimentswith compressing suffix arrays and applications. In Proc. 15th Annual ACM-SIAM Symposiumon Discrete Algorithms (SODA’04). ACM, 636–645.

Horspool, R. N. 1980. Practical fast searching in strings. Softw. Pract. Exp. 10, 6, 501–506.

Huffman, D. A. 1951. A method for the construction of minimum redundancy codes. Proc.I.R.E. 40, 1098–1101.

Hyyro, H., Fredriksson, K., and Navarro, G. 2006. Increased bit-parallelism for approximateand multiple string matching. ACM Journal of Experimental Algorithmics (JEA) 10, 2.6, 1–27.

Jacobson, G. 1989. Succinct static data structures. Ph.D. thesis, Carnegie Mellon University.

Karp, R. M. and Rabin, M. O. 1987. Efficient randomized pattern-matching algorithms. IBMJ. Res. Dev. 31, 2, 249–260.

Kim, D. K., Na, J. C., Kim, J. E., and Park., K. 2005. Efficient implementation of rank and selectfunctions for succinct representation. In Proc. of 4th Workshop on Efficient and ExperimentalAlgorithms (WEA’05). Springer, 315–327.

Knuth, D. E., Morris, Jr, J. H., and Pratt, V. R. 1977. Fast pattern matching in strings.SIAM J. Comput. 6, 1, 323–350.

Manber, U. and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAMJ. Comput. 22, 5, 935–948.

Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proceedings of the 7thAnnual Symposium on Combinatorial Pattern Matching, D. S. Hirschberg and E. W. Myers,Eds. Number 1075 in Lecture Notes in Computer Science. Springer-Verlag, Berlin, LagunaBeach, CA, 75–86.

Navarro, G. and Fredriksson, K. 2004. Average complexity of exact and approximate multiplestring matching. Theoretical Computer Science A 321, 2–3, 283–290.

Navarro, G. and Makinen, V. 2007. Compressed full-text indexes. ACM Computing Sur-veys 39, 1, article 2.

Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA) 5, 4.

Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences. Cambridge University Press. ISBN0-521-81307-7. 280 pages.

Okanohara, D. and Sadakane, K. 2007. Practical entropy-compressed rank/select dictionary.In Proceedings of ALENEX’07. ACM Press.

Pagh, R. 1999. Low redundancy in static dictionaries with O(1) worst case lookup time. InProceedings of ICALP’99. Springer-Verlag, 595–604.

Salmela, L., Tarhio, J., and Kytojoki, J. 2006. Multipattern string matching with q-grams.J. Exp. Algorithmics 11, 1.1.

Sunday, D. M. 1990. A very fast substring search algorithm. Commun. ACM 33, 8, 132–142.

Weiner, P. 1973. Linear pattern matching algorithm. In Proceedings of the 14th Annual IEEESymposium on Switching and Automata Theory. Washington, DC, 1–11.

Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing andIndexing Documents and Images. Morgan Kaufmann Publishers, San Francisco, CA.

Wu, S. and Manber, U. 1992. Fast text searching allowing errors. Commun. ACM 35, 10,83–91.

Yao, A. C. 1979. The complexity of pattern matching for a random string. SIAM J. Comput. 8, 3,368–387.