kuromoji fst

Post on 14-Aug-2015

442 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kuromoji FST2015/06/25

Yoshinari Fujinuma

Overview• Motivation

• Building FST

• How freezing works

• How equivalent detection works

• Compiled FST and Virtual Machine

Motivation• Efficient Key value store for dictionary look up

during tokenization

• String -> integers

• int -> token info

Why FST and not Trie? • Finite State Transducer (FST) = Finite State Automaton +

Output

• Able to merge both prefixes and suffixes too

• e.g. “can”, “cats”, “dogs”

Overview of how the build works

List of sorted words,

list of integers

FST Builder

FST Compiler

Object-based FST

Compiled FST

How Building / Compiling works

• two variables are the key

• previous word (prev)

• current word (current)

1. Skip common prefix between prev and current

2. make arcs to the temp states

3. Freeze (Finalize) states which suffix differ betw. prev and current

Toy example

• cat -> 0

• cats -> 1

• catx -> 2

Initializing states• InitializationFrozen states

Temp states

Freezing states• prev word = “”, current word = catFrozen states

Temp states t/0a/0c/0

Add Arc to suffix

Frozen states

Temp states s/1t/0a/0c/0

• prev word = cat, current word = cats

Freeze differing suffix• pre word = cats, current word = catxFrozen states

Temp states s/1t/0a/0c/0 x/2

Freeze differing suffix• pre word = cats, current word = catxFrozen states

Temp states s/1t/0a/0c/0 x/2

HashCode 1

Freeze differing suffix• pre word = cats, current word = catxFrozen states

Temp states s/1t/0a/0c/0 x/2

HashCode 1

Merge Equivalent states• pre word = catx, current word =“”Frozen states

Temp states s/1t/0a/0c/0

x/2

Freezing states• pre word = catx, current word = “”Frozen states s/1t/0a/0c/0

x/2

Temp states

Equivalent state detection• We want to merge equivalent states!

• Key-value store using HashMap

• Key: State.hashCode()

• Value: State Object

• Collisions are resolved by chaining

Arc Equivalence

c/0

• Same transition character

• Same destination state

• Same output

c/0

State Equivalence• All the outgoing set of arcs are equivalent

• Both states are of the same type of state

c/0

c/0

How Compiled FST works• Generates a “Program”

• Running a Program = look up a word in a dictionary

• Program runs on a Virtual Machine which we implemented

Compiled FST = “Program”

Virtual Machine

Worde.g. “cat”

Integer if exists in

dictionary

-1, it not

OR

Program• List of Instructions, 11 bytes each

• Operation code (Op code)

• Math or Accept, Match, Fail

Op code 1byte

transition char 2 bytes

output 4 bytes

target address 4 bytes

Match• Transition to a given address

• Accumulator += output

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2

….

Fail• Stop running the Program and return -1

• e.g. “tss”

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2

….

Match or Accept

• If the current character is the final char,

• Ends running the program and returns the accumulator

• Else Match

Instructions vs. Arcs• What instructions represent

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2

….

s/1x/2

t/0

Virtual Machine running backwards

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 25 Fail6 Match a 0 47 Fail8 Match c 0 6

• Because of freezing from suffixes

Use of Cache• The lookup for next state is done by linear search

• The num. of outgoing arcs from the start state is large

• Therefore, we cache those outgoing arcs

Summary• FST is theoretically more compact than tries

• Implemented FST Builder which builds

• Object-based FST

• Compiled FST, compact form

• Uses Virtual Machine to run the compiled program (= lookup a word)

References• Direct Construction of Minimal Acyclic Subsequential

Transducers, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

• Smaller representation of finite-state automata http://www.sciencedirect.com/science/article/pii/S0304397512003787

• Blog post by Ikawa-san http://qiita.com/ikawaha/items/be95304a803020e1b2d1

• This code is available at https://github.com/atilika/fst

top related