fst morphology

33
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003

Upload: clare-ware

Post on 31-Dec-2015

70 views

Category:

Documents


4 download

DESCRIPTION

FST Morphology. Miriam Butt October 2003. Based on Beesley and Karttunen 2003. Recap. Last Time: Finite State Automata can model most anything that involes a finite amount of states. We modeled a Coke Machine and saw that it could also be thought of as defining a language. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: FST Morphology

FST Morphology

Miriam Butt

October 2003

Based on Beesley and Karttunen 2003

Page 2: FST Morphology

Recap

Last Time: Finite State Automata can model most anything that involes a finite amount of states.

We modeled a Coke Machine and saw that it could also be thought of as defining a language.

We will now look at the extension to natural language more closely.

Page 3: FST Morphology

A One-Word Language

c a n t o

Page 4: FST Morphology

A Three-Word Language

c a n t o

ti g r e

me s a

Page 5: FST Morphology

Analysis: A Successful Match

c a n t o

ti g r e

me s a

m e s aInput:

Page 6: FST Morphology

Rejects

The analysis of libro, tigra, cant, mesas will fail.

Why?

Use: for example for a spell checker

Page 7: FST Morphology

Transducers: Beyond Accept and Reject

c a n t o

ti g r e

me s a

ti g r e

c a n t om

e s a

Page 8: FST Morphology

Transducers: Beyond Accept and Reject

Analysis Process:• Start at the Start State

• Match the input symbols of string against the lower-side symbol on the arcs, consuming the input symbols and finding a path to a final state.

• If successful, return the string of upper-side symbols on the path as the result.

• If unsucessful, return nothing.

Page 9: FST Morphology

A Two-Level Transducer

c a n t o

ti g r e

me s a

ti g r e

c a n t om

e s a

m e s aOutput:m e s aInput:

Page 10: FST Morphology

A Lexical Transducer

c a n t a

Output:

c a n tInput:

r+PresInd

+1P+Sg

c a n t 0 0 0 0 0o

o

c a n t

+Verb

a r +PresInd +1P +Sg+Verb

One Possible Path through the Network

Page 11: FST Morphology

A Lexical Transducer

c a n t o

Output:

c a n tInput:

+Masc +Sg

c a n t o 0 00

o

c a n t

+Noun

o +Masc +Sg+Noun

Another Possible Path through the Network(found through backtracking)

Page 12: FST Morphology

Why Transducer?

General Definition: Device that converts energy from one form to another.

In this Context: Device that converts one string of symbols into another another string of symbols.

Page 13: FST Morphology

The Tags

Tags or Symbols like +Noun or +Verb are arbitrary: the naming convention is determined by the (computational) linguist and depends on the larger picture (type of theory/type of application).

One very successful tagging/naming convention is the Penn Treebank Tag Set

Page 14: FST Morphology

The Tags

What kind of Tags might be useful?

Page 15: FST Morphology

Generation vs. Analysis

The same finite state transducers we have been using for the analysis of a given surface string can also be used in reverse: for generation.

The XRCE people think of analysis as lookup, of generation of lookdown.

Page 16: FST Morphology

Generation --- Lookdown

c a n t a

Output:

c a n tInput:

r+PresInd

+1P+Sg

c a n t 0 0 0 0 0o

oc a n t

+Verb

a r +PresInd +1P +Sg+Verb

Page 17: FST Morphology

Generation --- Lookdown

Analysis Process:

• Start at the Start State and the beginning of the input string

• Match the input symbols of string against the upper-side symbols on the arcs, consuming the input symbols and finding a path to a final state.

• If successful, return the string of lower-side symbols on the path as the result.

• If generation is unsuccessful, return nothing.

Page 18: FST Morphology

Tokenization

General String Conversion: Tokenizer

Task: Divide up running text into individual tokens

• couldn’t -> could not• to and from -> to-and-fro• ,,, -> , • The farmer walked -> The^TBfarmer^TBwalked

Page 19: FST Morphology

Sharing Structure and Sets

Networks can be compressed quite cleverly (p. 16-17).

FST Networks are based on formal language theory. This includes basics of set theory:

membership, union, intersection,subtraction, complementation

Sets:

Page 20: FST Morphology

Some Basic Concepts/Representations

Empty Language

Empty-String Language

Page 21: FST Morphology

Some Basic Concepts/Representations

A Network for the Universal Language

A Network for an Infinite Language

a

?

Page 22: FST Morphology

Relations

Ordered Set: members are ordered.

Ordered Pair: <A,B> vs. <B,A>

Relation: set whose members are ordered pairs • Family Trees (p. 21)• { <“cantar+Verb+PresInd+1P+Sg”, “canto”>, <“canto+Noun+Masc+Sg”, “canto”>,

... }

Page 23: FST Morphology

Relations

An Infinite Relation:.

Identity Relation: <“canto”, “canto”>

aA

bB

cC...

zZ

Page 24: FST Morphology

Basic Set Operations

• union (p. 24)• intersection (p. 25)• subtraction (p. 25)

Page 25: FST Morphology

Concatenation

One can also concatenate two existing languages (finite state networks with one another to build up new words productively/dynamically).

This works nicely, but one has to write extra rules to avoid things like: *trys, *tryed, though trying is okay.

Page 26: FST Morphology

Concatenationw o r k

i n gs

e d

Network for the Language {“work”}

Network for the Language {“s”, “ed”, “ing”}

Page 27: FST Morphology

Concatenation

w o r k i n gs

e dConcatenation of the two networks

What strings/language does this result in?

Page 28: FST Morphology

Composition

Composition is an operation on two relations.

Composition of the two relations <x,y> and <y,z> yields <x, z>

Example: <“cat”, “chat”> with <“chat”, “Katze”> gives <“cat”, “Katze”>

Page 29: FST Morphology

Composition

K a t z e

c h a t

c a t

c h a t

Page 30: FST Morphology

Composition

K a t z e

c a t

c h a t

Merging the two networks

Page 31: FST Morphology

Composition

K a t z e

c a t

The Composition of the Networks

What is this reminiscent of?

Page 32: FST Morphology

Composition + Rule Application

Composition can be used to encode phonology-style rules.

E.g., Lexical Phonology assumes several iterations/levels at which different kinds of rules apply.

This can be modeled in terms of cascades of FST transducers (p. 35).

These can then be composed together into one single transducer.

Page 33: FST Morphology

Next Time

• Begin working with xfst

• Now: Exercises 1.10.1, 1.10.2, 1.10.4 Optional: 1.10.3