kuromoji fst

28
Kuromoji FST 2015/06/25 Yoshinari Fujinuma

Upload: yoshinari-fujinuma

Post on 14-Aug-2015

442 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Kuromoji FST

Kuromoji FST2015/06/25

Yoshinari Fujinuma

Page 2: Kuromoji FST

Overview• Motivation

• Building FST

• How freezing works

• How equivalent detection works

• Compiled FST and Virtual Machine

Page 3: Kuromoji FST

Motivation• Efficient Key value store for dictionary look up

during tokenization

• String -> integers

• int -> token info

Page 4: Kuromoji FST

Why FST and not Trie? • Finite State Transducer (FST) = Finite State Automaton +

Output

• Able to merge both prefixes and suffixes too

• e.g. “can”, “cats”, “dogs”

Page 5: Kuromoji FST

Overview of how the build works

List of sorted words,

list of integers

FST Builder

FST Compiler

Object-based FST

Compiled FST

Page 6: Kuromoji FST

How Building / Compiling works

• two variables are the key

• previous word (prev)

• current word (current)

1. Skip common prefix between prev and current

2. make arcs to the temp states

3. Freeze (Finalize) states which suffix differ betw. prev and current

Page 7: Kuromoji FST

Toy example

• cat -> 0

• cats -> 1

• catx -> 2

Page 8: Kuromoji FST

Initializing states• InitializationFrozen states

Temp states

Page 9: Kuromoji FST

Freezing states• prev word = “”, current word = catFrozen states

Temp states t/0a/0c/0

Page 10: Kuromoji FST

Add Arc to suffix

Frozen states

Temp states s/1t/0a/0c/0

• prev word = cat, current word = cats

Page 11: Kuromoji FST

Freeze differing suffix• pre word = cats, current word = catxFrozen states

Temp states s/1t/0a/0c/0 x/2

Page 12: Kuromoji FST

Freeze differing suffix• pre word = cats, current word = catxFrozen states

Temp states s/1t/0a/0c/0 x/2

HashCode 1

Page 13: Kuromoji FST

Freeze differing suffix• pre word = cats, current word = catxFrozen states

Temp states s/1t/0a/0c/0 x/2

HashCode 1

Page 14: Kuromoji FST

Merge Equivalent states• pre word = catx, current word =“”Frozen states

Temp states s/1t/0a/0c/0

x/2

Page 15: Kuromoji FST

Freezing states• pre word = catx, current word = “”Frozen states s/1t/0a/0c/0

x/2

Temp states

Page 16: Kuromoji FST

Equivalent state detection• We want to merge equivalent states!

• Key-value store using HashMap

• Key: State.hashCode()

• Value: State Object

• Collisions are resolved by chaining

Page 17: Kuromoji FST

Arc Equivalence

c/0

• Same transition character

• Same destination state

• Same output

c/0

Page 18: Kuromoji FST

State Equivalence• All the outgoing set of arcs are equivalent

• Both states are of the same type of state

c/0

c/0

Page 19: Kuromoji FST

How Compiled FST works• Generates a “Program”

• Running a Program = look up a word in a dictionary

• Program runs on a Virtual Machine which we implemented

Compiled FST = “Program”

Virtual Machine

Worde.g. “cat”

Integer if exists in

dictionary

-1, it not

OR

Page 20: Kuromoji FST

Program• List of Instructions, 11 bytes each

• Operation code (Op code)

• Math or Accept, Match, Fail

Op code 1byte

transition char 2 bytes

output 4 bytes

target address 4 bytes

Page 21: Kuromoji FST

Match• Transition to a given address

• Accumulator += output

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2

….

Page 22: Kuromoji FST

Fail• Stop running the Program and return -1

• e.g. “tss”

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2

….

Page 23: Kuromoji FST

Match or Accept

• If the current character is the final char,

• Ends running the program and returns the accumulator

• Else Match

Page 24: Kuromoji FST

Instructions vs. Arcs• What instructions represent

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 2

….

s/1x/2

t/0

Page 25: Kuromoji FST

Virtual Machine running backwards

0 Fail None None None1 M / A x 2 02 M / A s 1 03 Fail4 Match t 0 25 Fail6 Match a 0 47 Fail8 Match c 0 6

• Because of freezing from suffixes

Page 26: Kuromoji FST

Use of Cache• The lookup for next state is done by linear search

• The num. of outgoing arcs from the start state is large

• Therefore, we cache those outgoing arcs

Page 27: Kuromoji FST

Summary• FST is theoretically more compact than tries

• Implemented FST Builder which builds

• Object-based FST

• Compiled FST, compact form

• Uses Virtual Machine to run the compiled program (= lookup a word)

Page 28: Kuromoji FST

References• Direct Construction of Minimal Acyclic Subsequential

Transducers, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

• Smaller representation of finite-state automata http://www.sciencedirect.com/science/article/pii/S0304397512003787

• Blog post by Ikawa-san http://qiita.com/ikawaha/items/be95304a803020e1b2d1

• This code is available at https://github.com/atilika/fst