nondeterministic finite automata in hardware - the case …tjt7a/docs/levenshtein_automata.pdf ·...

Nondeterministic Finite Automata in Hardware - the Caseof the Levenshtein Automaton

Tommy Tracy II, Mircea Stan, Nathan Brunelle, Jack Wadden, Ke Wang, Kevin

Skadron, Gabriel Robins

University of Virginia

Charlottesville, VA

[tjt7a, mrs8n, njb2b, jpw8bd, kw5na, ks7h, gr3e]@virginia.edu

ABSTRACTThe Levenshtein Nondeterministic Finite state Automaton(NFA) recognizes input strings within a set edit distance ofa configured pattern in linear time. This automaton canbe pipelined to recognize all substrings of an input text inlinear time with additional use of nondeterminism. In gen-eral, von Neumann hardware cannot directly execute NFAswithout significant time or space overhead. A von Neumannsimulation of the Levenshtein automaton incurs exponentialrun time overhead in the general case. A common techniqueto avoid the simulation overhead is to convert the pipelinedNFA to a DFA, but at the expense of heavy pre-computationand high space overhead.

In this paper, we introduce a novel technique for execut-ing a pipelined Levenshtein NFA using Microns AutomataProcessor (AP), avoiding the run time and space overheadsassociated with CPU and GPU implementations. We showthat run time remains linear with the input while the spacerequirement of the automaton becomes linear in the productof the configured pattern length and edit distance. Theseproperties allow the AP to execute large instances of theLevenshtein NFA or many small instances in parallel thusmaking the automaton a viable building block for futureapproximate string applications on the AP.

KeywordsAutomata Processor, Levenshtein Automaton, Nondetermin-istic Finite Automata, Approximate String Matching

1. INTRODUCTIONThe pipelined Levenshtein NFA A

L

(P, d) recognizes stringsthat approximately match a search string pattern P . Thecloseness of the match is measured by the edit distance dbetween the input string and P . This edit distance, alsoknown as the Levenshtein distance[2], is determined by theminimum number of primitive character operations requiredto convert the input string to P . The primitive character

operations considered are:

Insertions : waoo ! wahooDeletions : wahoo ! whoo

Substitutions : wahoo ! yahoo(1)

Approximate string matching is used in an array of appli-cation domains including bioinformatics, e.g. motif searchand alignment, and text retrieval, e.g. spell-checking andsearch engine applications. Utilizing a Levenshtein automa-ton is a method of determining approximate string matcheswithout explicitly calculating the edit distance between theinput string and the search string pattern.

A

L

(P, d) accepts inputs within edit distance d of P in lineartime with the input. A

L

(P, d) represents incurred errors andmatches with state transitions. When there is an error, thetype of error is ambiguous; choosing which error results inthe minimum edit distance is made easier with nondetermin-ism. Each accept state represents a dierent edit distancebetween the input and P . The state representing the mini-mal edit distance nondeterministicaly reachable on the inputgives its edit distance from the pattern.

A

L

(P, d) as an NFA is fairly straightforward to pipeline.Other approximate string matching algorithms have to firstsplit the input into k-mers before calculating the edit dis-tance between the k-mers and the search pattern. A

L

(P, d)can check all substrings of the input for a near-match inlinear time without the need for hashing or indexing!

Microns Automata Processor (AP) is an NFA engine, whichallows software developers to execute NFAs in hardware.This hardware can simulate NFAs as wide as can fit on thehardware, and doesnt have the same scalability limitationsas von Neumann simulation techniques like bit wise paral-lelism or the associated space explosion incurred by a pow-erset construction [5]. The ability to execute wide nondeter-minsm without the associated computation and space over-head makes the AP a promising platform for the pipelinedLevenshtein NFA.

In this paper we present our novel technique for executingthe pipelined Levenshtein NFA on Microns Automata Pro-cessor (AP). The AP can execute many Levenshtein NFAswith diering search string patterns concurrently processingthe same input. The rest of the paper will introduce theLevenshtein NFA, the Automata Processor, the technique

5.0

5.1

*

5.2

*

0.0 1.0w

0.1

*

1.1

* *

2.0a

2.1

*

w

0.2

*

1.2

* *

a

2.2

*

w a

*

3.0h

3.1

*

*

h

3.2

*

h

*

4.0o

4.1

*

*

o

4.2

*

o

o

**

o

**

o

Figure 1: A

L

(wahoo, 2)

we developed for executing the Levenshtein automaton onthe AP eciently, results, and finally application-dependentmodifications that can be made to the design.

2. THE LEVENSHTEIN AUTOMATONThe Levenshtein automaton [7] is a NFA configured for asearch string pattern P and max edit distance d , that rec-ognizes the set of strings that are within edit distance d ofP , meaning the input string can be transformed into P byat most d single-character operations: insertions, deletions,and substitutions.

Figure 1 is of a Levenshtein NFA for the search string pat-tern P=wahoo and edit distance d=2. The automatonencodes the running total of single-character errors in therow index. If the current input symbol fails to match on thecurrent search string pattern symbol, the automaton shiftsto a state that is one row upward, indicating a single sym-bol error. As more errors occur, the active automaton stateswill shift further upward until there are no more rows. TheAutomaton encodes the current match index of the searchstring pattern with the column index. For example, thefirst match transitions from the 1st to the 2nd column ofFigure 1 represent a match with the first character of thematch string, 0w0, the second set of transitions for the sec-ond character 0a0.

Each of the states in Figure 1 has a name that correspondsto the states match index value and the number of errorsincurred so far in the automaton0s reading of the input. Atstart state 0.0, the automaton has made no progress towardsthe acceptance states. By matching on the first character0w

0, the state 1.0 is activated. If the input string startswith a non-matching character, it will transition up one row,indicating an error on the first character. Finally, once allsymbols of the the search string pattern have been accountedfor with matches or edits, the last state of each row is anacceptance state. Activating an acceptance state indicatesthat the input is within the maximum allowed d of P andalso provides the edit distance between input and P basedon the row the acceptance state resides. The Automatonhandles all state transitions with the following rules:

1. All match transitions, are represented as a transition inthe rightward direction. Given the example automaton, theinput string wahoo would take the following path to an ac-cept state by only traversing matching transitions:

0.0(start) ! 1.0 ! 2.0 ! 3.0 ! 4.0 ! 5.0(accept)

2. Any character insertions, are represented as automatontransitions in the upward direction. As an example, the fol-lowing path represents the acceptance of the string wahoeo,where there is a single insertion after the first o in wahoo:

0.0(start) ! 1.0 ! 2.0 ! 3.0 ! 4.0 ! 4.1(insert) ! 5.1(ac-cept)

3. Any substitutions, where a character in P is replacedby another symbol in the input, are represented as a di-agonal * transition. Our example Levenshtein automatonwould accept the string waeoo, where e replaces h usingthe following path:

0.0(start) ! 1.0 ! 2.0 ! 3.1(substitution) ! 4.1 ! 5.1(ac-cept)

4. Finally, any deletions are represented in the Levenshteinautomaton with "-transitions. These "-transitions are usedin NFAs to represent free transitions, where a transitionis made without the need to consume an input symbol. Asan example, the Levenshtein automaton accepts the stringwah, because it is 2 deletions from wahoo. The automa-ton would use the following path to accept this string:

0.0(start)! 1.0! 2.0! 3.0! 4.1(delete)! 5.2(delete)(accept)

3. THE AUTOMATA PROCESSORMicrons Automata Processor (AP) is a re-configurable non-von Neumann MISD processor that can run multiple Non-deterministic Finite Automata (NFA) concurrently on thesame input stream. The automata are designed in a graph-ical environment called the AP Workbench that generatesan XML design file in the Automata Network Markup Lan-guage (ANML). This file is then compiled into a bitstreamthat can be processed by the AP hardware[1]. Alternatively,an AP design can be generated in ANML by a script; wechose this approach with our solution.

Automata on the AP are composed of a connected, directednetwork of State Transition Elements (STEs). These STEsare activated when the STEs are enabled by a neighboringSTE and the input matches the STEs assigned 8-bit symbolclass. STEs that represent a starting state are called StartSTEs; those that represent accept states are called Report-ing STEs. In order to simplify automata designs, the APsdesign tools include the macro construct. Developers candesign small, parameterizable automata as macros [8]. Weused this construct to create a set of macros and simplifythe design.

Microns current generation AP, called the D480, can runsat an input symbol rate of 133 MHz, with each chip support-ing two half-cores. Each half-core contains 96 blocks of 256STEs each. In total, each D480 chip contains 49,152 STEs,and each AP PCIe board can contain up to 48 AP chips. [8]

4. LEVENSHTEIN ON THE APThe Mealy-type NFA representation of the Levenshtein au-tomaton cannot be executed on the AP in its current form.The architecture requires that the automaton be a Moore-

Figure 2: Moore-type Levenshtein Implimentation

without "-transitions

type machine, where transitions are triggered on state val-ues. In addition, the AP cannot recognize "-transitions(those which consume no input). These limitations requiredus to develop a mapping technique for converting the Mealy-type Levenshtein automaton with "-transitions to an NFAthat the AP can execute. In this section we present our tech-nique that we have generated a script for and the overheadthat we incur to fulfill it. We also introduce a set of macrosthat our script uses to simplify the design of a Levenshteinautomaton.

4.1 Mealy vs. MooreAMoore-type state machine [4] is a finite-state machine withtriggers on the machines0s states. A transition is taken fromstate A to state B, if state A is active and if state B isdesignated to trigger on the input symbol. A Mealy-typestate machine [3] is a finite-state machine with triggers onthe transitions. Instead of having a separate state for eachpossible input symbol, a single Mealy-type state could havemultiple transitions into it with diering input triggers. Thisallows a Mealy-type Machine to be more compact than aMoore-type Machine. Figure 1 is a Mealy-type machine di-agram.

The Automata Processor0s architecture can only realize Moore-type Machine designs. We converted the Levenshtein au-tomaton to a Moore-type Machine by creating super stateswhich we realize as macros as shown in Table 1. Each STEin a super state macro represents both possible input sym-bols to that state: a search string symbol match, or an error,represented with *. The * character represents all sym-bols in the APs alphabet, so these transitions are also takenin the case of a match. This is not a problem because the re-sulting active set explosion isnt a complication for the APsarchitecture. Each super state macro has two input and twooutput ports connected to their respective STEs. In addi-tion, all outgoing transitions from the original Mealy-typestate must be duplicated for each STE in the super state;therefore each output port will have identical transitions.

Figure 2 shows the resulting super states after convertingthe Mealy-type machine to a Moore-type machine. The

bottom-most row and left-most column do not contain su-per states, because they only have one transition into each ofthese states. All other states are represented with two STEswith identical outgoing transitions. This figure only containsmatch, insertion, and substitution transitions to simplify thediagram. "-transitions cannot be represented as STEs with"-match symbols; we will address "-transitions next.

4.2 Epsilon-TransitionsAnother limitation of the AP is the lack of support for "-transitions. Many Mealy-type NFAs use the "-transition toreduce the complexity of a design. We will present a seriesof macros to solve this problem.

For internal transitions, we can handle "-transitions by...Roy and Aluru[6] handle "-transitions by connecting all statesthat have outgoing "-transitions to the states that have theassociated incoming "-transition. This method works for in-ternal states, but does not account for "-transitions fromstarting states or "-transitions to accept states. We brokethe Leavenshtein automatons "-transitions into three cate-gories: starting state "-transitions, accept state "-transitions,and internal "-transitions.

We classified all states that could be reached with "-transitionsfrom the starting state 0.0 as starting super states. Thosestates that were reachable and had a match transition werecalled late-start match super states to indicate that theyserve as starting states that match on P , but after one ormore deletions. We created a macro called the Late StartMatch Block that represents both states in the super state,where the match STE is a starting STE. In the case of start-ing with an error other than a deletion, we introduced aStarting Error Block macro. In this macro we have the er-ror state serve as the starting STE. Because this error incursa row penalty, the first Starting Error Block macro is on thesecond row.

To account for deletions from the end of the search stringpattern, we created Simple Error Report Blocks. Theseblocks report on a symbol or error match, and representthe acceptance of the input string. In the case of Figure 1,only the last states in each row served as accept states. Thiswas fine for a machine with "-transitions, because deletionswere accounted for by these transitions from lower rows. Tohave the same functionality without "-transitions we hadto collapse the "-transitions to form report diagonals. Re-port diagonals represent all possible deletions from the endof the Search String Pattern. For this reason, our resultingAP executable Levenshtein automaton had several more re-port elements to account for those diagonals; one for eachstate that was within "-transitions from an accept state inthe original design.

Finally, the last case of "-transitions that we needed to han-dle was internal "-transitions that originated from statesthat were neither starting states, nor accept states. To dothis, we devised an iterative algorithm to account for alldeletions possible from each given state. For example, toaccount for all deletions from the bottom-left most state inFigure 2 we used transitions that skipped one column andmatched on the second row to represent a deletion and amatch. We then created a transition from that same state

to one column over and an error match on the 3rd row toaccount for a deletion and a mismatch. The algorithm thencontinues for larger automata to include multiple deletionsfollowed by a match or error. It should be noted that wedid not account for deletions followed by insertions, becausethat is equivalent to a single substitution.

4.3 Construction AlgorithmIn this section we present the technique we used to con-struct Figure 3, a Moore-type Levenshtein NFA without "-transitions to execute on the AP given an arbitrary searchstring pattern P and an edit distance d . The one limitationthat our algorithm imposes is that the d be less than thelength of P . Intuitively, this seams a reasonable stipulationbecause if d were the same as the length of P , the emptystring would be an acceptable input to the Automaton, ef-fectively producing a useless machine. We break this sectioninto steps to clarify each component of the algorithm.

1. In this step, we construct the bottom row of the Lev-enshtein automaton. The first block in the bottom row isthe Starting Match Block, a starting macro that matches onthe first character of P . This is then connected to a chainof Simple Match Blocks until d blocks from the end of therow. All blocks after this index serve as Reporting MatchBlocks because they represent early Reports due to poten-tial deletions at the end of an otherwise perfect matchinginput pattern. As an example, the string wah would beaccepted by the Automaton because it is two deletions fromwahoo. wahoand wahoowould also be accepted on thebottom-most row of the automaton.

2. These Reporting Match Blocks represent the first blockon the deletion diagonals (reachable by "-transitions). Asthe row index increases, representing more cumulative er-rors, fewer deletions can be accounted for at the end of thesearch string pattern to be within d, therefore reducing thenumber of Reporting Match Blocks until the top-most rowonly has a single Reporting Match Block. The right-mostReporting Match Block on the bottom row represents anedit distance of 0, the diagonal originating at the 2nd tolast STE from the end represents an edit distance of 1, andso on. We can use this information to assign edit distancevalues to each of the Reporting Match Blocks.

3. The Late Start Match Blocks are placed in a diagonalfrom the Starting Match Block. These blocks represent earlydeletions, where the first 1 to d characters of the MatchString are deleted. As expected, as the number of earlydeletions increases, so does the row index, for errors, andthe column count, for the matching P index.

4. The Simple Starting Error Block is placed one column be-fore and one row up from the Starting Match Block. If thereare errant insertions before P , it is necessary to account forthese with this block. This block is then connected to a ver-tical chain of Error Match STEs that go up to the last row.This column represents all possible insertions allowed beforethe Match String.

5. Next, the Starting Error Block is placed on the nextrow above the Start Match Block. This block representsthe first mismatching character (replacement of W) in the

input string. To account for the possibility of deletions andthen mismatches, this block also has a diagonal of StartingError Blocks to account for these deletions in the upper-rightdirection.

6. Finally, the rest of the blocks in the design are SimpleError Blocks. These blocks neither serve as starting nodesnor reporting nodes.

The blocks are connected in the same way as the Mealy-type design for all but "-transitions; we used the previouslydiscussed start STEs, early reporting STEs, and iterativedeletion transitions to account for these transitions. To au-tomate this algorithm, we developed a script for generatingan AP-compatible Levenshtein NFA given any input stringmatch pattern and max edit distance. We made all startingSTEs in the design start on all STEs. This meant thatevery starting STE would be active for all symbols in thisinput; this resulted in a pipelined Levenshtein NFA that de-termines approximate string matches with all substrings ofthe input, but still in linear time!

5. SCALABILITYBecause the Automata processor can run multiple NFAs con-currently on the same input, scalability is important whenconsidering the performance of the AP. It is important thatthe number of STEs required is minimized so that the APhardware can execute as many automata concurrently aspossible. In this section we will determine the number ofSTEs required to realize a Levenshtein automaton, and thenumber of Levenshtein automata we can execute at one timeon the AP.

d represents the maximum edit distance.L represents the length of the match string pattern P .

STE

total

= d+ L+ (2 L d)STE

reporting

= (d+ 1)2

These equations indicate that the number of STEs requiredto construct a Levenshtein NFA scales with the product ofthe configured pattern length L and the maximum edit dis-tance d, and that the number of reporting STEs is O(d2).Considering that for most bioinformatics applications theedit distance is kept relatively low at 5 or less, these re-sults indicate a scalable design. We determined that a singleAP core can hold an Automaton configured with a patternlength of size 2730 and edit distance of 4. When consideringa relatively modest pattern size of 100 with an edit distanceof 4, a single AP core has the capacity for 27 parallel au-tomata.

In the case of an application that requires executing theautomaton on many dierent search string patterns, morethan the capacity of the available AP hardware, it is possibleto use soft re-configurations (reload match symbols) to swapon new search string patterns. As long as d and the lengthof P do not change, the same AP design can be used.

Table 1: Levenshtein Macros

Macro Name Levenshtein Macros Description

Starting Match Block

- Starting STE- Matches on first character of search stringpattern.

Simple Starting Error Block

- Starting STE- Matches on error values.- Accounts for errors in the beginning of aninput string.

Simple Match Block

- A single STE matches on a character inthe matching string.- This macro is used on the bottom row,where no errors occur.

Reporting Match Block

- This block is the same as the simplematch block, except it reports on a match.

Late Start Match Block

- Starting STE- This macro accounts for inputs with be-ginning deletions, where the input stringstarts late.- This macro has two states: one that trig-gers on a matching character, and anotheron an error.

Starting Error Block

- Starting STE- Matches on all characters.- Used to accept input strings with startingerrors.

Simple Error Block

- This block has two STEs: one to matchon the match string character, the other tomatch on an error.- This block neither starts no accepts.

Simple Error Report Block

- This block has two accepting STEs: onefor a match, the other for an error.

Figure 3: AP-Executable A

L

(wahoo, 2)

Figure 4: A reduced size Levenshtein Automaton

6. FUNCTIONAL MODIFICATIONSOur illustrated design determines if any subset of the inputstring is within edit distance d of P , and reports if so, as wellas reporting the indicated edit distance of the matching in-put string. This design can be extended to include additionalfunctionality like variable scoring, where dierent primitiveoperations have diering costs associated with them. Thisis particularly useful for bioinformatics applications whereedit operations occur with diering probabilities. Anotherenhancement we can make is reducing the size of the Lev-enshtein NFA by removing early insertion and late deletionstates. This allows us to reduce the number of STEs re-quired to instantiate a single Levenshtein automaton, butat the expense of determining the inputs Levenshtein score.

6.1 ScoringVariable edit scoring has potential applications in bioinfor-matics. It allows a programmer to assign dierent costs tothe edit operations accounted for by the NFA. To explainthis enhancement, we will consider insertions. Our designassigns a cost of one error to insertions, deletions, and sub-stitutions. If we wanted to double the cost of an insertion,we could adjust our upward direction transitions from in-stead of connecting to the next immediate row, connect tothe row two rows up. This would account for an insertioncost of 2. This same strategy could be used for any of theother edit operators. The limitation here is that the costwould need to be an integer value.

6.2 Levenshtein ReductionThe Levenshtein NFA recognizes an input string by trigger-ing a reporting STE. This STE has an edit distance asso-ciated with it, depending on which diagonal it is on. Thebottom-right-most row has an accept state with edit dis-tance 0, because the entire pattern has been matched withno errors; the second row has an edit distance 1 assignedto it, because one error has been discovered in the traversalof the input to the accept state. If the edit distance metricis not of importance it is possible to reduce the size of theLevenshtein NFA to accept on only max edit distance acceptstates. This works because if a string X is a perfect matchof P , X with one deletion is within edit distance 1 of P .In this way, it is possible to account for all possible stringsby only accepting the worst-case edit distance and ignoringprefixes and suxes.

Figure 4 shows the reduced version of AL

(wahoo, 2). It isthe same automaton but with removed early insertion statesand late insertion states. What is clear is that only d+1 re-porting states are required, because only worst-case reports

are necessary. For this reason, the last reporting elementdiagonal serves as the last state in each of the rows. Thismodification to the Levenshtein automaton has a small im-pact on the over-all size of the automaton, but if many areused in parallel, this may have a non-negligible impact onthe capacity of the hardware.

7. CONCLUSIONS AND FUTURE WORKIn this paper, we presented a technique for executing apipelined Levenshtein NFA on Microns Automata Proces-sor. We found that preserving the automatas nondeter-minism resulted in a scalable design on the AP. We alsopresented several modifications that could be made to thedesign to account for variable edit costs as well as a modifi-cation for reducing the size of the automaton at the expenseof determining edit distance.

The Levenshtein automaton has potential in the field ofbioinformatics, and with the introduction of the AP, thatpotential can finally be unlocked. Future work includes com-paring the execution of A

L

(P, d) on the AP versus the DFAand simulated versions on a CPU and GPU. Finally, we willimplement a short read aligner from A

L

(P, d) and compareour results to the state of the art CPU and GPU aligners.

8. ACKNOWLEDGMENTSWe would like to thank Micron for their cooperation withthis and many other AP projects. We would also like toacknowledge the help that we got from the Center for Au-tomata Processing (CAP) at the University of Virginia.

9. REFERENCES[1] P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal,

and H. Noyes. An ecient and scalable semiconductorarchitecture for parallel automata processing. Paralleland Distributed Systems, IEEE Transactions on,25(12):30883098, Dec 2014.

[2] V. I. Levenshtein. Binary codes capable of correctingdeletions, insertions, and reversals. Doklady AkademiiNauk SSSR, 10(8):845aAS848, Feb 1966.

[3] G. H. Mealy. A method for synthesizing sequentialcircuits. Bell System Technical Journal, The,34(5):10451079, Sept 1955.

[4] E. F. Moore. Gedanken-experiments on sequentialmachines. Automata studies, 34:129153, 1956.

[5] G. Navarro. A guided tour to approximate stringmatching. ACM Comput. Surv., 33(1):3188, Mar.2001.

[6] I. Roy and S. Aluru. Finding motifs in biologicalsequences using the micron automata processor. InParallel and Distributed Processing Symposium, 2014

IEEE 28th International, pages 415424, May 2014.[7] K. U. Schulz and S. Mihov. Fast string correction with

levenshtein automata. International Journal onDocument Analysis and Recognition, 5(1):6785, 2002.

[8] K. Wang, M. Stan, and K. Skadron. Association rulemining with the micron automata processor. InProceedings of the 29th IEEE International Parallel and

Distributed Processing Symposium (IPDPS), 2015.

nondeterministic finite automata in hardware - the case …tjt7a/docs/levenshtein_automata.pdf ·...

Documents