hardware-accelerated regular expression matching for high

Kubilay Atasu – IBM Research Zurich

© 2013 IBM Corporation

Hardware-accelerated regular expression matching for high-throughput text analytics Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick. R. Reiss IBM Research – Zurich & IBM Research – Almaden

© 2013 IBM Corporation 2

Outline

Text analytics systems

Advanced regex features

Network of state machines

Implementation & experiments

Conclusions & future work

K. Atasu et al.


SystemT: an algebraic approach to declarative information extraction

distill structured data from unstructured and semi-structured text

exploit the extracted data in your applications

For years, Microsoft

Corporation CEO Bill Gates

was against open source. But

today he appears to have

changed his mind. "We can be

open source. We love the

concept of shared source,"

said Bill Veghte, a Microsoft

VP. "That's a super-important

shift for us in terms of code

access.“

Richard Stallman, founder of

the Free Software Foundation,

countered saying…

Name Title Organization

Bill Gates CEO Microsoft

Bill Veghte VP Microsoft

Richard Stallman Founder Free Soft..

(from Cohen’s IE tutorial, 2003)

Annotations

K. Atasu et al.


A typical SystemT information extraction query

Find the names (regex) that are at most 20 chars after a title (dict.)

Founder..............Bill Gates

regex

match

at most 20 chars

dict.

match

end offset start offset start offset end offset

start offset

result

end offset

4

K. Atasu et al.


Outline






K. Atasu et al.


Regex matching: background

Consider the regex .*a*b[ˆa]*ca*b

Can be transformed into NFA/DFA

Hybrid solutions are also possible

NFA

DFA

NFA with a single nondeterministic state

6

K. Atasu et al.


match 4

Start Offset Reporting

Consider the regex .*a*b[ˆa]*ca*b

Consider the input string abcabcab

There are four distinct regex matches

Only 2 and 4 are leftmost matches

Each one has a different start offset

a b c a b c a b

match 1

match 2

match 3

NFA needs to remember multiple start offsets!

NFA must know which one(s) to report, when!

No existing HW architecture addresses this problem:

Reconfigurable NFAs (Baker FCCM 2001, Bispo FPT 2006 , Yang ANCS 2008, …)

Programmable DFAs (Smith SIGCOMM 2008, Van Lunteren INFOCOM 2012, …)

7

K. Atasu et al.


Capturing groups

Consider the regex .*a*(?b[ˆa]*c)a*b

A capturing group is marked by (? )

Assume regexs of the type .*R1(?R2)R3

Build the DFAs of R1, R2, and R3

– interconnected by epsilon transitions

epsilon removal creates nondeterminism

– but, only at the terminal states

– all other states are deterministic

– at most two possible next states

8

K. Atasu et al.


Outline






K. Atasu et al.


Our solution: network of state machines

Use a network of state machines and enable each state machine to remember its start offset

10

K. Atasu et al.



Shutdown Logic: – deactivate configurations storing the

same state value

– only the one with the smallest start

offset remains active

Routing Logic: – route branch configurations to inactive

state machines

– based on active flags (af) and branch

flags (bf)

11

K. Atasu et al.


Computing the size of the network

The size of the network can be statically computed:

– the largest number of NFA states mapped to a single DFA state

– typically much smaller than the number of NFA states

While supporting leftmost match semantics

– if multiple state machines reach the same state, only one will remain alive

– shutdown logic will pick the configuration with the smallest start offset

NFA

DFA

12

K. Atasu et al.


What kind of network? (architecture 1)

Only state 0 can assert its branch flag

Only a single branch configuration is routed

A priority encoder is sufficient for arbitration


Partially expanded NFA can be minimized

DFA compression techniques can be used

Also decreases the size of the network

13

K. Atasu et al.


What kind of network? (architectures 2 & 3)

Pack and unpack operations (left)

– log(N) delay, N log(N) area each

– 2log(N) delay and 2N log(N) total area

A single wide pack operation (below)

– log(2N) = log(N)+1 delay (lower)

– 2Nlog(2N) = 2Nlog(N)+2N area (higher)

– simpler shutdown logic

– this is our Architecture 3


– single branch flag

– N+1→N pack instead of a 2N→N pack

– this is our Architecture 2

14

K. Atasu et al.


Outline






K. Atasu et al.



Design: two stage, dual threaded pipeline

– cycle 1: next configuration computation

– cycle 2: interconnection network

Hardwired next state computation

– DFA compression techniques

Benchmark # regexs Comb. LUTs Registers Frequency

Text Analytics 25 4.4% 4.3% ~250 MHz

L7 Filter 101 29% 20% ~150 MHz

Device: Altera Stratix IV GX530KH40C2

– using Altera Quartus II, V 11

– target frequency: 250 MHz

Single pipeline throughput rate: 2 Gb/s

– single stream throughput rate: 1 Gb/s

Up to 16 Gb/s aggregate throughput rate for Text Analytics regexs

– using 8 hardware pipelines (16 document streams)

320 fold faster than the software having the same functionality

– using 16 software threads running on an Intel™ Xeon E5530

All L7 filter regexs have been made unanchored (more complex)

– it’s trivial to compute the start offset for anchored regexs

16

K. Atasu et al.


Text analytics regular expressions: resource usage

0

1000

2000

3000

4000

5000

6000

7000

8000

1 3 5 7 9 11 13 15 17 19 21 23 25

Architecture 1

Architecture 2

Architecture 3

Capturing Gr.

A histogram of the resource usage (ALUTs) of 25 regexs

K. Atasu et al.


Text analytics regular expressions: clock frequency

A histogram of the clock frequency (MHz) of 25 regexs

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Architecture 1

Architecture 2

Architecture 3

Capturing Gr.

K. Atasu et al.


Outline






K. Atasu et al.



Novel regex matching architecture that supports advanced features

–start offset reporting, capturing groups, leftmost match semantics

–based on a network of state machines and an optimized network

–strictly forward processing, without introducing any back-pressure

Its reconfigurable hardware implementation and experiments that show

–up to 16 Gb/s aggregate throughput rate using 8 hardware pipelines

–up to 320 X speed-up over 16 software threads running on a server

– including an evaluation of various implementation choices

The current and future work includes

–reducing the resource consumption to improve the scalability

–making the next-configuration-computation logic programmable

–supporting additional regex features, such as back-references

20

K. Atasu et al.

Kubilay Atasu – IBM Research Zurich


Hardware-accelerated regular expression matching for high-throughput text analytics Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick. R. Reiss IBM Research – Zurich & IBM Research – Almaden

hardware-accelerated regular expression matching for high

Documents