finite-state methods in natural-language processing:...

42
1 Motivation — Finite-State Methods in Natural-Language Processing: 1—Motivation Ronald M. Kaplan and Martin Kay

Upload: phungtu

Post on 11-Jul-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

1Motivation —

Finite-State Methods inNatural-Language Processing:

1—Motivation

Ronald M. Kaplan

and

Martin Kay

Page 2: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

2Motivation —

Finite-State Methods in LanguageProcessing

The Application of a branch of mathematics

—The regular branch of automata theory

to a branch of computational linguistics in which whatis crucial is (or can be reduced to)

—Properties of string sets and string relations with

—A notion of bounded dependency

Page 3: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

3Motivation —

Applications

• Finite Languges—Dictionaries

—Compression

• Phenomena involvingbounded dependency—Morpholgy

• Spelling

• Hyphenation

• Tokenization

• Morphological Analysis

—Phonology

• Approximations tophenomena involvingmostly bounded dependency—Syntax

• Phenomena that can betranslated into the realm ofstrings with boundeddependency—Syntax

Page 4: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

4Motivation —

Correspondences

Computational DeviceFinite-State Automaton

Descriptive Notation:Regular Expression

Set of Objects:Regular Language

Page 5: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

5Motivation —

The Basic Idea

• At any given moment,an automaton is in one of afinite number of states

• A transition from one state to another is possiblewhen the automaton contains a correspondingtransition.

• The process can stop only when the automaton is inone of a subset of the states, called final.

• Transitions are labeled with symbols so that asequence of transitions corresponds to a sequenceof symbols.

Page 6: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

6Motivation —

Bounded Dependency

The choice between γ1 and γ2 depends on a boundednumber of preceding symbols.

γ1

γ2

?

Page 7: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

7Motivation —

Bounded Dependency

The choice between γ1 and γ2 depends on a boundednumber of preceding symbols.

γ1

γ2

?si si

irrelevant

Page 8: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

8Motivation —

Closure Properties and Operations

• By definition—Union

—Concatenation

—Iteration

• By deduction—Intersection

—Complementation

—Substitution

—Reversal

— ...

Page 9: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

9Motivation —

Operations on Languages andAutomata

For the set-theoretic operations on languages thereare corresponding operations on automata.

M(L) is a machine that characterizes the language L.

We will use the same symbols for corresponding operationsWe will use the same symbols for corresponding operations

M(L1⊗ L2) = M(L1)⊕M(L2)

Page 10: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

10Motivation —

Automata-based Calculus

• Closure gives:—Complementation → Universal quantification

—Intersection → Combinations of constraints

• Machines give:—Finite representations for (potentially) infinite sets

—Practical implementation

• Combination gives:—Coherence

—Robustness

—Reasonable machine transformations

Page 11: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

11Motivation —

Quantification

There is an x followed by a y in the string

There is no xy sequence in the string

There is a y preceded by something that is not an x

Σ*xyΣ*

Σ*xyΣ*

Every y is preceded by an x.

Σ*xyΣ*

Σ*xyΣ*

∃y.∃x.precedes(x, y)

Page 12: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

12Motivation —

Universal Quantification—i before e except after c

Σ*ceiΣ*After e: no i

After c: anything

Not aft

er c o

r e:

anyth

ing bu

t ei

Page 13: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

13Motivation —

Only e i after c

Σ*ceiΣ* ∩Σ*cieΣ*

i,other

4

e

othe

r

e

c

c

ce, otherc

i,other

i

Page 14: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

14Motivation —

Only e i after c

After e: no i

After c: not ie

Not after c or e:

anything but ei

After ci: no e

Page 15: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

15Motivation —

Alternative Notations

Closure ⇒ Recursive Formalisms ⇒

Higher-level Constructs

Choose notation for theoretical significance andpractical convenience.

L1← L2 ≡ L1L2

Σ*ceiΣ* ≡ Σ*c← eiΣ*

Page 16: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

16Motivation —

What is a Finite-State Automaton?

• An alphabet of symbols,

• A finite set of states,

• A transition function from states and symbols tostates,

• A distinguished member of the set of states calledthe start state, and

• A distinguished subset of the set of states calledfinal states.

Pace terminology, same definition as fordirected graphs with labeled edges, plus

initial and final states.

Pace terminology, same definition as fordirected graphs with labeled edges, plus

initial and final states.

Page 17: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

17Motivation —

i to xUnless otherwise

marked, the start state isusually the leftmost in

the diagram

We draw final stateswith a double circle

Page 18: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

18Motivation —

Regular Languages

• Languages — sets of strings

• Regular languages — a subset of languages

• Closed under concatenation, union, and iteration

• Every regular language is chracterized by (at least)one finite-state automaton

• Languages may contain infinitely many strings butautomata are finite

Page 19: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

19Motivation —

Regular Expressions

• Formulae with operators that denote—union

—concatenation

—iteration

a* [b | c]a* [b | c]

Any number of a’s followed by either b or c.

Page 20: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

20Motivation —

Some Motivations

• Word Recognition

• Dictionary Lookup

• Spelling Conventions

Page 21: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

21Motivation —

A word recognizer takes a string of characters asinput and returns “yes” or “no” according as theword is or is not in a given set.

Solves the membership problem.

Word Recognition

e.g. Spell Checking, Scrabble

Page 22: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

22Motivation —

• Has right set of letters (any order).

• Has right sounds (Soundex).

• Random (suprimposed) coding (Unix Spell)

Approximate methods

Wordhash1hash2

hashk

BitTable

Page 23: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

23Motivation —

Exact Methods

• Hashing

• Search (linear, binary ...)

• Digital search (“Tries”)

aa

r dv a

rk

b a c k

sh e

d

si n g

ze

d

i n gs

b u z

a c k

Folds together common prefixes

Page 24: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

24Motivation —

Exact Methods (continued)

• Finite-state automata

Folds together common prefixes and suffixes

Page 25: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

25Motivation —

Enumeration vs. Description

• Enumeration—Representation includes an item for each object.

Size = f(Items)

• Description—Representation provides a characterization of the

set of all items.Size = g(Common properties, Exceptions)

—Adding item can decrease size.

Page 26: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

26Motivation —

Classification

Exact Approximate

Enumeration Hash table SoundexBinary search

Description Trie Unix SpellFSM Right letter

Page 27: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

27Motivation —

FSM Extends to Infinite SetsProductive compounding

Kindergartensgeselschaft

Page 28: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

28Motivation —

Statistics

English PortugueseVocabulary

Words 81,142 206,786KBytes 858 2,389PKPAK 313 683PKZIP 253 602

FSMStates 29,317 17,267Transitions 67,709 45,838KBytes 203 124

From Lucchese and Kowaltowski (1993)

Page 29: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

29Motivation —

Dictionary lookup takes a string of characters asinput and returns “yes” or “no” according as theword is or is not in a given set and returnsinformation about the word.

Dictionary Lookup

Page 30: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

30Motivation —

Lookup MethodsApproximate — guess the information

If it ends in “ed”, it’s a past-tense verb.

Exact — store the information for finitely many wordsTable Lookup

• Hash

• Search

• Trie —store at word-endings.

FSM

• Store at final states?

No suffix collapse — reverts to Trie.

Page 31: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

31Motivation —

Word Identifiers

Associate a unique, useful, identifier with each of nwords, e.g. an integer from 1 to n. This can be usedto index a vector of dictionary information.

n

word → ii Information

Page 32: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

32Motivation —

Pre-order Walk

A pre-order walk of an n-word FSM, counting finalstates, assigns such integers, even if suffixes arecollapsed ⇒ Linear Search.

drip → 1drips → 2drop → 3drops → 4

Page 33: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

33Motivation —

Suffix Counts

• Store with each state the size of its suffix set

• Skip irrelevant transitions, incrementing count bydestination suffix sizes.

drip → 1drips → 2drop → 3drops → 4

4 4 4 2 1 0

Page 34: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

34Motivation —

• Minimal Perfect Hash (Lucchesi and Kowaltowski)

• Word-number mapping (Kaplan and Kay, 1985)

Page 35: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

35Motivation —

Spelling Conventions

iN+tractable → intractable

iN+practical → impractical

iN is the common negative prefix— im before labial

— in otherwise

c.f. input → input

Page 36: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

36Motivation —

An in/im Transducer No exit from thisstate except over

a labial.

No labials from this state

Page 37: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

37Motivation —

Generation — “intractable”

0 0

i

i

N

m

N

n

2

t

t

0

r

r

0

a

a

1

Page 38: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

38Motivation —

Generation — “impractical”

0 0

i

i

N

m

N

n

2

1

p

p

0

r

r

0

a

a

Page 39: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

39Motivation —

Recognition — “intractable”

0 0

i

i

n

n

N

n

2

t

t

0

r

r

0

a

a

0

t

t

0

r

r

0

a

a

Page 40: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

40Motivation —

Generation — “input”

0 0

i

i

N

m

N

n

2

1

p

p

0

u

u

0

t

t

0

Page 41: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

41Motivation —

A Word Transducer

Base Forms

Morphology

Spelling Rules

Text Forms

Finite-StateTransducers

Finite-stateMachine

Finite-state machine

Page 42: Finite-State Methods in Natural-Language Processing: …web.stanford.edu/class/linguist139p/MotivationSlides.pdf · one finite-state automaton ... Motivation — 21 A word recognizer

42Motivation —

BibliographyKaplan, Ronald M. and Martin Kay. “Regular

models of phololgical rule systems.” ComputationalLinguistics, 23:3, September 1994.

Hopcroft, John E. and Jeffrey D. Ullman.Introduction to Automata Theory, Languages andComputation. Addison Wesley, 1979. Chapters 2and 3.

Partee, Barbara H., Alice ter Meulen and Robert E.Wall. Mathematical Methods in Linguistics. Kluwer,1990. Chapters 16 and 17.

Lucchesi, Cláudio L. and Tomasz Kowaltowski.“Applications of Finite Automata RepresentingLarge Vocabularies”. Software Practice andEsperience. 23:1, January 1993.