speech and language processing - systems group · notions like noun and verb. ... regulars and...
TRANSCRIPT
Speech and Language Processing
Lecture 3Chapter 3 of SLP
Today
● English Morphology● Finite-State Transducers (FST)● Morphology using FST● Example with flex● Tokenization, sentence breaking● Spelling correction, edit distance
Words
● Finite-state methods are particularly useful in dealing with a lexicon.
● Many devices, most with limited memory, need access to large lists of words
● And they need to perform fairly sophisticated tasks with those lists
● So we’ll first talk about some facts about words and then come back to computational methods
English Morphology
● Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes
● We can usefully divide morphemes into two classes○ Stems: The core meaning-bearing units○ Affixes: Bits and pieces that adhere to stems to
change their meanings and grammatical functions
English Morphology
● We can further divide morphology up into two broad classes:
○ Inflectional○ Derivational
Word Classes
● By word class, we have in mind familiar notions like noun and verb.
● We’ll go into more details in the next lesson (part-of-speech tagging).
● Right now we’re concerned with word classes because the way that stems and affixes combine is based to a large degree on the word class of the stem
Inflectional Morphology
● Inflectional morphology concerns the combination of stems and affixes where the resulting word:○ Has the same word class as the original○ Serves a grammatical/semantic purpose that is
■ Different from the original■ But is nevertheless transparently related to the
original
Nouns and Verbs in English
● English nouns are simple○ Markers for plural and possessive.○ More complicated in other languages:
■ Gender (masculine, femenine, neutral, …)■ Cases (nominative, accusative, vocative, …)■ Agglutinative languages.■ Etc.
● Verbs are only slightly more complex○ Markers appropriate to the tense of the verb.○ Again, more complicated in other languages.
Regulars and Irregulars
● It is a little complicated by the fact that some words do not follow the rules:
● For nouns:○ Rule for plural: add -s (-es, -(i)es, ...)○ Exceptions: Mouse/mice, goose/geese,
ox/oxen
Regulars and Irregulars
● For verbs:
● Other verbs do not follow the rules, e.g.■ Go/went, fly/flew■ Be / was / were / am / are / is / been / being
● The terms regular and irregular are used to refer to words that follow the rules and those that don’t
Morphological class
Base walk merge map3rd. person walks merges maps-ing participle walking merging mappingPast, Past participle walked merged mapped
Regular and Irregular Verbs
● Regulars…○ Walk, walks, walking, walked, walked
● Irregulars○ Eat, eats, eating, ate, eaten○ Catch, catches, catching, caught, caught○ Cut, cuts, cutting, cut, cut
Inflectional Morphology
● So inflectional morphology in English is fairly straightforward.○ The regular words are productive, they apply
to new words.■ blogger■ To fax
● But morphology complicated by the fact that are irregularities
Derivational Morphology
● Derivational morphology is not as straightforward:○ Quasi-systematicity○ Irregular meaning change○ Changes of word class
Derivational Examples
Verbs and Adjectives to Nouns
-ation computerize computerization
-ee appoint appointee
-er kill killer
-ness fuzzy fuzziness
Derivational Examples
Nouns and Verbs to Adjectives
-al computation computational
-able embrace embraceable
-less clue clueless
Example: Compute
● Many paths are possible…
● Start with compute○ Computer → computerize → computerization
○ Computer → computerize → computerizable → computerizability
● But not all paths/operations are equally good (allowable?)○ Clue
■ Clue -> *clueable
Morpholgy and FSAs
● We’d like to use the machinery provided by FSAs to capture these facts about morphology○ Accept strings that are in the language.○ Reject strings that are not.○ And do so in a way that doesn’t require us to
in effect list all the words in the language.
Start Simple
● Regular singular nouns are ok● Regular plural nouns have an -s on the
end● Irregulars are ok as is
Simple Rules
Now Plug in the Words
Derivational Rules
If everything is an accept state how do things ever get rejected?
Parsing/Generation vs. Recognition
● We can now run strings through these machines to recognize strings in the language
● But recognition is usually not quite what we need ○ Often if we find some string in the language we might
like to assign a structure to it (parsing)○ Or we might have some structure and we want to
produce a surface form for it (production/generation)
● Example○ From “cats” to “cat +N +PL”
Finite State Transducers
● The simple story○ Add another tape○ Add extra symbols to the transitions
○ On one tape we read “cats”, on the other we write “cat +N +PL”
FSTs
Applications
● The kind of parsing we’re talking about is normally called morphological analysis
● It can either be ○ An important stand-alone component of
many applications (spelling correction, information retrieval)
○ Or simply a link in a chain of further linguistic analysis
Transitions
● c:c means read a c on one tape and write a c on the other
● +N:ε means read a +N symbol on one tape and write nothing on the other
● +PL:s means read +PL and write an s
c:c a:a t:t+N: ε +PL:s
Typical Uses
● Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).
● And we’ll write to the second tape using the other symbols on the transitions.
Ambiguity
● Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state.○ Didn’t matter which path was actually
traversed● In FSTs the path to an accept state does
matter since different paths represent different parses and different outputs will result.
Ambiguity
● What’s the right parse (segmentation) for○ Unionizable○ Union-ize-able○ Un-ion-ize-able
● Each represents a valid path through the derivational morphology machine.
Ambiguity
● There are a number of ways to deal with this problem○ Simply take the first output found○ Find all the possible outputs (all paths) and
return them all (without choosing)○ Bias the search so that only one or a few
likely paths are explored
Some more details
● Of course, its not as easy as “cat +N +PL” <-> “cats”
● As we saw earlier there are geese, mice and oxen
● But there are also a whole host of spelling/pronunciation changes that go along with inflectional changesCats vs DogsFox and Foxes
Multi-Tape Machines
● To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next
● So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols
Multi-Level Tape Machines
We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape
Lexical to Intermediate Level
Intermediate to Surface
●The add an “e” rule as in fox^s# <-> foxes#
Foxes
Note
● A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply.
● Meaning that they are written out unchanged to the output tape.
Overall Scheme
● We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity).○ Lexical level to intermediate forms
● We have a larger set of machines that capture orthographic/spelling rules.○ Intermediate forms to surface forms
Overall Scheme
Cascades
● This is an architecture that we’ll see again and again○ Overall processing is divided up into distinct
rewrite steps○ The output of one layer serves as the input
to the next○ The intermediate tapes may or may not wind
up being useful in their own right
Composition
1. Create a set of new states that correspond to each pair of states from the original machines (New states are called (x,y), where x is a state from M1, and y is a state from M2)
2. Create a new FST transition table for the new machine according to the following intuition…
Composition
There should be a transition between two states in the new machine if it’s the case that the output for a transition from a state from M1, is the same as the input to a transition from M2 or…
Composition
● δ3((xa,ya), i:o) = (xb,yb) iff○ There exists c such that○ δ1(xa, i:c) = xb AND○ δ2(ya, c:o) = yb
Example using flex
[demonstrated in the class]
● You can use “aspell dump master” or similar commands to dump an aspell dictionary. They are available for many languages!
● Using it, it is possible to construct a finite automaton that recognizes the disjunction of all these words.
Example using flexExample:
/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>%}
SPACE [ \t]NO_SPACE [^ \t]%%
{SPACE}* printf("\n");W printf("%s: valid\n", yytext);w printf("%s: valid\n", yytext);WW printf("%s: valid\n", yytext);WWW printf("%s: valid\n", yytext);WY printf("%s: valid\n", yytext);(…) // The aspell dump can be copied here{NO_SPACE}* printf("%s: invalid\n", yytext);%%
void main(int argc, char* argv[]) { yylex();}
Example using flexExample 2: it is possible to create a disjunction of the words
/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>%}
SPACE [ \t]NO_SPACE [^ \t]%%
{SPACE}* printf("\n");W|w|WW|WWW|... printf("%s: valid\n", yytext);{NO_SPACE}* printf("%s: invalid\n", yytext);%%
void main(int argc, char* argv[]) { yylex();}
Example using flex
● Compiling and running:
$ flex file.flex$ gcc lex.yy.c$ ./a.out
● Why can't we use all the dictionary dump?○ Flex transforms the regular expression into a Non-deterministic
finite automata with one path per dictionary words. With 100,000 words, there are hundreds of thousands of states.
○ Then it makes it deterministic (which usually makes the number of states bigger, remember the algorithm)
○ Finally it minimizes the automaton making it more compact.○ During these intermediate calculations, if we use the whole
dictionary dump, the number of states is too large for flex to handle (but it should be computationally feasible).
Example using flexExample 3: instead of recognizing the words, we can lemmatize them.
/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>
// Define ChangeSuffix(n1, ending) as a function that copies yytext, removes n1 characters from the end, appends “ending” to it and prints the result to stdout.%}
SPACE [ \t]NO_SPACE [^ \t]AlphaNum [a-zA-Z0-9]*%%{AlphaNum}men ChangeSuffix(2, “an”); // men → man{AlphaNum}lves ChangeSuffix(3, “f”); // wolves → wolf{AlphaNum}ives ChangeSuffix(3, “fe”); // wives → wife...%%
void main(int argc, char* argv[]) { yylex();}
Tokenization
Mr. Sherwood said reaction to Sea Containers' proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.
Tokenization
Mr. Sherwood said reaction to Sea Containers' proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.
[sequence of alphanumeric symbols]
Tokenization
Mr. Sherwood said reaction to Sea Containers' proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.
[separated by whitespaces]
Tokenization
Mr. Sherwood said reaction to
Sea Containers ' proposal has been “ very positive . ” In
New York Stock Exchange composite trading yesterday ,
Sea Containers closed at $62.625 , up 62.5 cents.
[correct?]
Tokenization
● In English, usually solved with simple regular expressions handling:○ Possessive markers and clitics.○ Numbers.○ Abbreviations.○ Quotations and punctuation.○ Combined later on with Named Entity
recognizers.
● In languages without whitespaces, very different methods are used.
Sentence splitting
● Identification of where sentences start and end.
● Some sentence markers are usually unambiguous, e.g. ? or !
● Ambiguity comes from the double meaning of dots (abbreviation marker vs. sentence marker).
Spelling correction
● Non-word error detection (e.g. flex example seen before).
● Isolated-word error correction, e.g. graffe → giraffe.
● In-context error correction (examples using a search engine)
Spelling correction
● Isolated word error correction:○ Three operations allowed:
■ Insertion■ Deletion■ Substitution
○ Find the closest dictionary word in terms of edit distance operations needed to go from the observed non-dictionary word to a dictionary word.
Homework
● The next class is part-of-speech tagging. You can read Chapter 5 from the book.
● There are many natural language toolkits on the web (NLTK, Stanford CoreNLP, LingPipe, OpenNLP, etc.):○ Try downloading one of them and running it on the
DUC texts until you are satisfied with the tokenization, lemmatization and sentence splitting output.
○ Alternatively, try writing your own tokenization and sentence splitting module from scratch!