a memory-efficient -removal algorithm for weighted finite-state automata thomas hanneforth,...

24
A MEMORY-EFFICIENT -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Upload: clayton-marley

Post on 01-Apr-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

A MEMORY-EFFICIENT -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA

Thomas Hanneforth, Universität Potsdam

Page 2: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Overview

-transitions in finite-state based NLP Removing -transitions in weighted finite-

state automata: an algorithm by M. Mohri

Some formal definitions An improved algorithm demonstrated in

the case of acyclic automata Experiments

Page 3: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

-Transitions in finite-state based NLP

Many NLP applications based on (weighted) finite-state automata (WFSM) create a lot of -transitions during processing

Examples: Applying bracketing rules (NE-recognition, local

grammars) Corpus processing

These -transitions have to be removed due to speed and efficiency reasons.

In many cases, the finite state automata containing the -transitions are acyclic.

Page 4: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Example: N-gram counting in corpora

A corpus is a disjunction of sentences. The corpus of which the N-grams are to

be counted is represented as an acyclic WFSM over the real semiring.

That means: the weighted along a path in the corpus WFSM are multiplied to compute the absolute frequency of a given sentence.

The N-gram counter is represented as a special cyclic weighted finite-state transducer.

Page 5: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Example: N-gram counting in corpora

A corpus C as a WFSM

For example, the absolute frequency of the sentence bbcd is 4 · 0.25 · 1 · 1 · 1 = 1

Page 6: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Example: N-gram counting in corpora

A corpus as a WFSM

Counting is basically composition of the corpus with the counting transducer and taking the lower tape of the result: 2(C T)

A bigram counting transducer T

Page 7: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Example: N-gram counting in corpora

2(C T)

Page 8: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

-removal in WFSMs: Mohri‘s algorithm

1. For each state p compute the -distance to any other reachable state q.

2. For each -path with distance w from p to q and a single transition from q to r labeled with a and weight w’, add a transition from p to r with label a and weight ww’ to the FSA. If q is a final state, p will also become a final state. If p already was a final state, the final weights of q and p are additively combined.

3. Remove all -transitions, non-reachable states and non-contributing transitions.

Page 9: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

-removal in WFSMs: Mohri‘s algorithm

General -removal pattern:

The states for which the pattern is applied can be visited in any order

The -distance between p and q is w

Weights w and w‘ are combined by

multiplication

Page 10: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

-removal in WFSMs: Mohri‘s algorithm

If the -subgraph of the WFSM is acyclic, it is possible to process the states in reverse topological order:Example:

Reverse topological order

Two transitions attached to non-reachable states are superfluous and have to be removed in step 3

Nevertheless, they preserve the weights associated with -transitions earlier in the reversed topological order.

Page 11: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

An improved algorithm: Idea

The attachment of newly created transitions to inaccessible states must be somehow avoided

But, when applying the reverse topological order strategy, these transitions are necessary even if they are deleted in step 3 of the algorithm

Thus, the reverse topological order strategy can be no longer used

Simple idea: keep track of reachable states I will focus on the special case of acyclic WFSMs

Page 12: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Some formal definitions

1)<,,0> is a commutative monoid with 0 as the identity element for

2)<,,1> is a monoid with 1 as the identity element for

3) distributes over 4)0 is an annihilator for : w , w 0 = 0 w

= 0

A structure <,,,0,1> is a semiring if it fulfils the following conditions:

Semiring

Common semirings are the real semiring <R, +, ·, 0, 1> and the tropical semiring <R, min, +, 0, >.

Page 13: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Some formal definitions

-distance between two states p and q

w()-dist(p,q) = (p, , q)

Set of all paths between p and q labeled with

w() = w[t1] w[t2] ... w[tk] All -path weights

are abstractly added

A path = t1t2 … tk

Page 14: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

An improved algorithm: example

-Reachable

Topological order

-distance(0) = { 1,0.1, 2,0.3, 3,0.6 }-distance(4) =

= {0,4}= {0}

Page 15: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

An improved algorithm

Input: An acyclic WFSA A = ,Q,q0,F,E,Output: An equivalent -free WFSA A’R -reachable({q0})for all p Q in ascending order do

if p R thenD compute-shortest--distances(A,p)R‘ for all q,w D do

for all t E[q] doE E { p, l[t],w w[t], n[t] }R‘ R‘ {n[t]}

end foradjust-final-state(A,p,q)

end forR R -reachable(R‘)

end ifend fordelete--transitions(A)delete-states(Q-R)connect(A)return A

Page 16: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Improved algorithm: -distances

-distances are usually computed with a generalized shortest-distance algorithm

For cyclic WFSMs, this algorithm may be optimized by letting it operate on the strongly connected components of the WFSM

For acyclic WFSMs, relaxation in topological order is the most efficient algorithm

Page 17: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Improved algorithm: Computing -distances

1. Topologically sort the input WFSM and use this order for computing -distances

2. Construct a embedded topological order for every -subautomaton (two-pass strategy)

3. As 2., but cache already computed distances4. Topologically sort the input WFSM and make

use of a priority queue which is ordered after state number

There are at least 4 approaches to compute acyclic -distances:

Page 18: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Improved algorithm: Computing -distances in an acyclic WFSM

Example:

The global topological order is 0 1 2 3 4 5 6 There are two -subgraphs rooted at states 1 and 2,

respectively. The topological orders are:

1 3 4 5 2 4 5

In a topologically ordered WFSM, whenever you have a transition p q, the state number of q is strictly greater than the state number of p.

Page 19: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Improved algorithm: -distances with a priority queueInput:Output:S PQ enqueue(PQ,p)while PQ do

q pop(PQ)if q S then

S S {q}if q = p then dq 1else dq d[q]end iffor all t E[q] do

d[n[t]] d[n[t]] (dq w[t])enqueue(PQ,n[t])

end forend if

end whilereturn d

Page 20: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Improved algorithm: Complexity Of course, in the worst case the

algorithm presented here has the same complexity as Mohri‘s algorithm

So, the complexity is: In the acyclic case: O(|Q||E| + |Q|2) In the cyclic case: O(|Q||E| +|Q|2 log |Q|)

The memory complexity is in O(|Q|) As the experiments will show, there is a

clear improvement in practical cases

Page 21: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Experiments: Input data

Input data: 50,000 sentences of the German TiGer corpus

Compiled into an optimised WFSM over the real semiring with 681,689 states and 730,175 transitions with || = 89,418

To that, a trigram counter was applied This resulted in a WFSM with 2,724,212

states and 3,615,890 transitions (1,429,530 -transitions)

The out-degree, that is, the maximum number of outgoing transitions for a state was 14,044

Page 22: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Experiments

AlgorithmTotal time

(s)Max.

memory usage(MB)

# transitions

(before connect)

Mohri‘s algorithm with reverse topological order strategy

3.48 409 13,306,056

Algorithm with reachability enforcement: Processing the –subautomata in topological order

8.46 116 2,912,740

Algorithm with reachability enforcement: Using a priority queue

8.21 106 2,912,740

The experiments were run on an Intel Quadcore CPU with 2.5 GHz (one core used)Transition labels and weights use both 4 Bytes

Page 23: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Experiments: Conclusions

Mohri's original algorithm is very fast, since in the acyclic case it only requires a single traversion through the state sequence. But, 83.5 % of the added transitions were useless

Its memory usage depends crucially on the out-degree of the input WFSM which in turn depends on the size of the alphabet

That is, for bigger corpora with alphabet sizes of several hundred thousand symbols, the non-optimized approach may become unfeasible

The revised algorithm in its two variants perform slower, since they compute -distances

But their memory requirements are much lower

Page 24: A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam

Appendix

adjust-final-state(A,p,q)if q F then

if p F then (p) (p) (w (q))

else F F {p}(p) w (q)

end ifend if