cs60057 speech &natural language processing
DESCRIPTION
CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture4 1 August 2007. MORPHOLOGY. Finite State Machines. FSAs are equivalent to regular languages FSTs are equivalent to regular relations (over pairs of regular languages) FSTs are like FSAs but with complex labels. - PowerPoint PPT PresentationTRANSCRIPT
Lecture 1, 7/21/2005 Natural Language Processing 1
CS60057Speech &Natural Language
Processing
Autumn 2007
Lecture4
1 August 2007
Lecture 1, 7/21/2005 Natural Language Processing 2
MORPHOLOGY
Lecture 1, 7/21/2005 Natural Language Processing 3
Finite State Machines
FSAs are equivalent to regular languages FSTs are equivalent to regular relations (over pairs of
regular languages) FSTs are like FSAs but with complex labels. We can use FSTs to transduce between surface and
lexical levels.
Lecture 1, 7/21/2005 Natural Language Processing 4
Simple Rules
Lecture 1, 7/21/2005 Natural Language Processing 5
Adding in the Words
Lecture 1, 7/21/2005 Natural Language Processing 6
Derivational Rules
Lecture 1, 7/21/2005 Natural Language Processing 7
Parsing/Generation vs. Recognition
Recognition is usually not quite what we need. Usually if we find some string in the language we need
to find the structure in it (parsing) Or we have some structure and we want to produce a
surface form (production/generation) Example
From “cats” to “cat +N +PL” and back
Lecture 1, 7/21/2005 Natural Language Processing 8
Morphological Parsing
Given the input cats, we’d like to outputcat +N +Pl, telling us that cat is a plural noun.
Given the Spanish input bebo, we’d like to outputbeber +V +PInd +1P +Sg telling us that bebo is the present indicative first person singular form of the Spanish verb beber, ‘to drink’.
Lecture 1, 7/21/2005 Natural Language Processing 9
Morphological Anlayser
To build a morphological analyser we need: lexicon: the list of stems and affixes, together with basic information
about them
morphotactics: the model of morpheme ordering (eg English plural morpheme follows the noun rather than a verb)
orthographic rules: these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine (e.g., fly+s = flies)
Lecture 1, 7/21/2005 Natural Language Processing 10
Lexicon & Morphotactics
Typically list of word parts (lexicon) and the models of ordering can be combined together into an FSA which will recognise the all the valid word forms.
For this to be possible the word parts must first be classified into sublexicons.
The FSA defines the morphotactics (ordering constraints).
Lecture 1, 7/21/2005 Natural Language Processing 11
Sublexicons to classify the list of word parts
reg-noun irreg-pl-noun irreg-sg-noun plural
cat mice mouse -s
fox sheep sheep
geese goose
Lecture 1, 7/21/2005 Natural Language Processing 12
FSA Expresses Morphotactics (ordering model)
Lecture 1, 7/21/2005 Natural Language Processing 13
Towards the Analyser
We can use lexc or xfst to build such an FSA
To augment this to produce an analysis we must create a transducer Tnum which maps between the lexical level and an "intermediate" level that is needed to handle the spelling rules of English.
Lecture 1, 7/21/2005 Natural Language Processing 14
Three Levels of Analysis
Lecture 1, 7/21/2005 Natural Language Processing 15
1. Tnum: Noun Number Inflection
• multi-character symbols• morpheme boundary ^• word boundary #
Lecture 1, 7/21/2005 Natural Language Processing 16
Intermediate Form to Surface
The reason we need to have an intermediate form is that funny things happen at morpheme boundaries, e.g.cat^s catsfox^s foxesfly^s flies
The rules which describe these changes are called orthographic rules or "spelling rules".
Lecture 1, 7/21/2005 Natural Language Processing 17
More English Spelling Rules
consonant doubling: beg / begging y replacement: try/tries k insertion: panic/panicked e deletion: make/making e insertion: watch/watches Each rule can be stated in more detail ...
Lecture 1, 7/21/2005 Natural Language Processing 18
Spelling Rules
Chomsky & Halle (1968) invented a special notation for spelling rules.
A very similar notation is embodied in the "conditional replacement" rules of xfst.
E -> F || L _ Rwhich means replace E with F when it appears between left context L and right context R
Lecture 1, 7/21/2005 Natural Language Processing 19
A Particular Spelling Rule
This rule does e-insertion
^ -> e || x _ s#
Lecture 1, 7/21/2005 Natural Language Processing 20
e insertion over 3 levels
The rule corresponds to the mapping betweensurface and intermediate levels
Lecture 1, 7/21/2005 Natural Language Processing 21
e insertion as an FST
Lecture 1, 7/21/2005 Natural Language Processing 22
Incorporating Spelling Rules
Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned".
The set of spelling rules is positioned between the surface level and the intermediate level.
Parallel execution of FSTs can be carried out: by simulation: in this case FSTs must first be aligned. by first constructing a a single FST corresponding to their
intersection.
Lecture 1, 7/21/2005 Natural Language Processing 23
Putting it all together
execution of FSTi
takes place in parallel
Lecture 1, 7/21/2005 Natural Language Processing 24
Kaplan and Kay: The Xerox View
FSTi are alignedbut separate
FSTi intersectedtogether
Lecture 1, 7/21/2005 Natural Language Processing 25
Finite State Transducers
The simple story Add another tape Add extra symbols to the transitions
On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around.
Lecture 1, 7/21/2005 Natural Language Processing 26
FSTs
Lecture 1, 7/21/2005 Natural Language Processing 27
English Plural
surface lexical
cat cat+N+Sg
cats cat+N+Pl
foxes fox+N+Pl
mice mouse+N+Pl
sheep sheep+N+Pl
sheep+N+Sg
Lecture 1, 7/21/2005 Natural Language Processing 28
Transitions
c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the
other +PL:s means read +PL and write an s
c:c a:a t:t +N:ε +PL:s
Lecture 1, 7/21/2005 Natural Language Processing 29
Typical Uses
Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).
And we’ll write to the second tape using the other symbols on the transitions.
Lecture 1, 7/21/2005 Natural Language Processing 30
Ambiguity
Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed
In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result
Lecture 1, 7/21/2005 Natural Language Processing 31
Ambiguity
What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able
Each represents a valid path through the derivational morphology machine.
Lecture 1, 7/21/2005 Natural Language Processing 32
Ambiguity
There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return
them all (without choosing) Bias the search so that only one or a few likely paths
are explored
Lecture 1, 7/21/2005 Natural Language Processing 33
The Gory Details
Of course, its not as easy as “cat +N +PL” <-> “cats”
As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation
changes that go along with inflectional changes Cats vs Dogs Fox and Foxes
Lecture 1, 7/21/2005 Natural Language Processing 34
Multi-Tape Machines
To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next
So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols
Lecture 1, 7/21/2005 Natural Language Processing 35
Generativity
Nothing really privileged about the directions. We can write from one and read from the other or vice-
versa. One way is generation, the other way is analysis
Lecture 1, 7/21/2005 Natural Language Processing 36
Multi-Level Tape Machines
We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape
Lecture 1, 7/21/2005 Natural Language Processing 37
Lexical to Intermediate Level
Lecture 1, 7/21/2005 Natural Language Processing 38
Intermediate to Surface
The add an “e” rule as in fox^s# <-> foxes#
Lecture 1, 7/21/2005 Natural Language Processing 39
Foxes
Lecture 1, 7/21/2005 Natural Language Processing 40
Note
A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply.
Meaning that they are written out unchanged to the output tape.
Turns out the multiple tapes aren’t really needed; they can be compiled away.
Lecture 1, 7/21/2005 Natural Language Processing 41
Overall Scheme
We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). Lexical level to intermediate forms
We have a larger set of machines that capture orthographic/spelling rules. Intermediate forms to surface forms
Lecture 1, 7/21/2005 Natural Language Processing 42
Overall Scheme
Lecture 1, 7/21/2005 Natural Language Processing 43
http://nltk.sourceforge.net/index.php/Documentation