speech and language processing - systems group · notions like noun and verb. ... regulars and...

Speech and Language Processing

Lecture 3Chapter 3 of SLP

Today

● English Morphology● Finite-State Transducers (FST)● Morphology using FST● Example with flex● Tokenization, sentence breaking● Spelling correction, edit distance

Words

● Finite-state methods are particularly useful in dealing with a lexicon.

● Many devices, most with limited memory, need access to large lists of words

● And they need to perform fairly sophisticated tasks with those lists

● So we’ll first talk about some facts about words and then come back to computational methods

English Morphology

● Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes

● We can usefully divide morphemes into two classes○ Stems: The core meaning-bearing units○ Affixes: Bits and pieces that adhere to stems to

change their meanings and grammatical functions

English Morphology

● We can further divide morphology up into two broad classes:

○ Inflectional○ Derivational

Word Classes

● By word class, we have in mind familiar notions like noun and verb.

● We’ll go into more details in the next lesson (part-of-speech tagging).

● Right now we’re concerned with word classes because the way that stems and affixes combine is based to a large degree on the word class of the stem

Inflectional Morphology

● Inflectional morphology concerns the combination of stems and affixes where the resulting word:○ Has the same word class as the original○ Serves a grammatical/semantic purpose that is

■ Different from the original■ But is nevertheless transparently related to the

original

Nouns and Verbs in English

● English nouns are simple○ Markers for plural and possessive.○ More complicated in other languages:

■ Gender (masculine, femenine, neutral, …)■ Cases (nominative, accusative, vocative, …)■ Agglutinative languages.■ Etc.

● Verbs are only slightly more complex○ Markers appropriate to the tense of the verb.○ Again, more complicated in other languages.

Regulars and Irregulars

● It is a little complicated by the fact that some words do not follow the rules:

● For nouns:○ Rule for plural: add -s (-es, -(i)es, ...)○ Exceptions: Mouse/mice, goose/geese,

ox/oxen

Regulars and Irregulars

● For verbs:

● Other verbs do not follow the rules, e.g.■ Go/went, fly/flew■ Be / was / were / am / are / is / been / being

● The terms regular and irregular are used to refer to words that follow the rules and those that don’t

Morphological class

Base walk merge map3rd. person walks merges maps-ing participle walking merging mappingPast, Past participle walked merged mapped

Regular and Irregular Verbs

● Regulars…○ Walk, walks, walking, walked, walked

● Irregulars○ Eat, eats, eating, ate, eaten○ Catch, catches, catching, caught, caught○ Cut, cuts, cutting, cut, cut

Inflectional Morphology

● So inflectional morphology in English is fairly straightforward.○ The regular words are productive, they apply

to new words.■ blogger■ To fax

● But morphology complicated by the fact that are irregularities

Derivational Morphology

● Derivational morphology is not as straightforward:○ Quasi-systematicity○ Irregular meaning change○ Changes of word class

Derivational Examples

Verbs and Adjectives to Nouns

-ation computerize computerization

-ee appoint appointee

-er kill killer

-ness fuzzy fuzziness

Derivational Examples

Nouns and Verbs to Adjectives

-al computation computational

-able embrace embraceable

-less clue clueless

Example: Compute

● Many paths are possible…

● Start with compute○ Computer → computerize → computerization

○ Computer → computerize → computerizable → computerizability

● But not all paths/operations are equally good (allowable?)○ Clue

■ Clue -> *clueable

Morpholgy and FSAs

● We’d like to use the machinery provided by FSAs to capture these facts about morphology○ Accept strings that are in the language.○ Reject strings that are not.○ And do so in a way that doesn’t require us to

in effect list all the words in the language.

Start Simple

● Regular singular nouns are ok● Regular plural nouns have an -s on the

end● Irregulars are ok as is

Simple Rules

Now Plug in the Words

Derivational Rules

If everything is an accept state how do things ever get rejected?

Parsing/Generation vs. Recognition

● We can now run strings through these machines to recognize strings in the language

● But recognition is usually not quite what we need ○ Often if we find some string in the language we might

like to assign a structure to it (parsing)○ Or we might have some structure and we want to

produce a surface form for it (production/generation)

● Example○ From “cats” to “cat +N +PL”

Finite State Transducers

● The simple story○ Add another tape○ Add extra symbols to the transitions

○ On one tape we read “cats”, on the other we write “cat +N +PL”

Applications

● The kind of parsing we’re talking about is normally called morphological analysis

● It can either be ○ An important stand-alone component of

many applications (spelling correction, information retrieval)

○ Or simply a link in a chain of further linguistic analysis

Transitions

● c:c means read a c on one tape and write a c on the other

● +N:ε means read a +N symbol on one tape and write nothing on the other

● +PL:s means read +PL and write an s

c:c a:a t:t+N: ε +PL:s

Typical Uses

● Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).

● And we’ll write to the second tape using the other symbols on the transitions.

Ambiguity

● Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state.○ Didn’t matter which path was actually

traversed● In FSTs the path to an accept state does

matter since different paths represent different parses and different outputs will result.

Ambiguity

● What’s the right parse (segmentation) for○ Unionizable○ Union-ize-able○ Un-ion-ize-able

● Each represents a valid path through the derivational morphology machine.

Ambiguity

● There are a number of ways to deal with this problem○ Simply take the first output found○ Find all the possible outputs (all paths) and

return them all (without choosing)○ Bias the search so that only one or a few

likely paths are explored

Some more details

● Of course, its not as easy as “cat +N +PL” <-> “cats”

● As we saw earlier there are geese, mice and oxen

● But there are also a whole host of spelling/pronunciation changes that go along with inflectional changesCats vs DogsFox and Foxes

Multi-Tape Machines

● To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next

● So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols

Multi-Level Tape Machines

We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

Lexical to Intermediate Level

Intermediate to Surface

●The add an “e” rule as in fox^s# <-> foxes#

Note

● A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply.

● Meaning that they are written out unchanged to the output tape.

Overall Scheme

● We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity).○ Lexical level to intermediate forms

● We have a larger set of machines that capture orthographic/spelling rules.○ Intermediate forms to surface forms

Overall Scheme

Cascades

● This is an architecture that we’ll see again and again○ Overall processing is divided up into distinct

rewrite steps○ The output of one layer serves as the input

to the next○ The intermediate tapes may or may not wind

up being useful in their own right

Composition

1. Create a set of new states that correspond to each pair of states from the original machines (New states are called (x,y), where x is a state from M1, and y is a state from M2)

2. Create a new FST transition table for the new machine according to the following intuition…

Composition

There should be a transition between two states in the new machine if it’s the case that the output for a transition from a state from M1, is the same as the input to a transition from M2 or…

Composition

● δ3((xa,ya), i:o) = (xb,yb) iff○ There exists c such that○ δ1(xa, i:c) = xb AND○ δ2(ya, c:o) = yb

Example using flex

[demonstrated in the class]

● You can use “aspell dump master” or similar commands to dump an aspell dictionary. They are available for many languages!

● Using it, it is possible to construct a finite automaton that recognizes the disjunction of all these words.

Example using flexExample:

/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>%}

SPACE [ \t]NO_SPACE [^ \t]%%

{SPACE}* printf("\n");W printf("%s: valid\n", yytext);w printf("%s: valid\n", yytext);WW printf("%s: valid\n", yytext);WWW printf("%s: valid\n", yytext);WY printf("%s: valid\n", yytext);(…) // The aspell dump can be copied here{NO_SPACE}* printf("%s: invalid\n", yytext);%%

void main(int argc, char* argv[]) { yylex();}

Example using flexExample 2: it is possible to create a disjunction of the words

/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>%}

SPACE [ \t]NO_SPACE [^ \t]%%

{SPACE}* printf("\n");W|w|WW|WWW|... printf("%s: valid\n", yytext);{NO_SPACE}* printf("%s: invalid\n", yytext);%%


Example using flex

● Compiling and running:

$ flex file.flex$ gcc lex.yy.c$ ./a.out

● Why can't we use all the dictionary dump?○ Flex transforms the regular expression into a Non-deterministic

finite automata with one path per dictionary words. With 100,000 words, there are hundreds of thousands of states.

○ Then it makes it deterministic (which usually makes the number of states bigger, remember the algorithm)

○ Finally it minimizes the automaton making it more compact.○ During these intermediate calculations, if we use the whole

dictionary dump, the number of states is too large for flex to handle (but it should be computationally feasible).

Example using flexExample 3: instead of recognizing the words, we can lemmatize them.

/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>

// Define ChangeSuffix(n1, ending) as a function that copies yytext, removes n1 characters from the end, appends “ending” to it and prints the result to stdout.%}

SPACE [ \t]NO_SPACE [^ \t]AlphaNum [a-zA-Z0-9]*%%{AlphaNum}men ChangeSuffix(2, “an”); // men → man{AlphaNum}lves ChangeSuffix(3, “f”); // wolves → wolf{AlphaNum}ives ChangeSuffix(3, “fe”); // wives → wife...%%


Tokenization

Mr. Sherwood said reaction to Sea Containers' proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.

Tokenization


[sequence of alphanumeric symbols]

Tokenization


[separated by whitespaces]

Tokenization

Mr. Sherwood said reaction to

Sea Containers ' proposal has been “ very positive . ” In

New York Stock Exchange composite trading yesterday ,

Sea Containers closed at $62.625 , up 62.5 cents.

[correct?]

Tokenization

● In English, usually solved with simple regular expressions handling:○ Possessive markers and clitics.○ Numbers.○ Abbreviations.○ Quotations and punctuation.○ Combined later on with Named Entity

recognizers.

● In languages without whitespaces, very different methods are used.

Sentence splitting

● Identification of where sentences start and end.

● Some sentence markers are usually unambiguous, e.g. ? or !

● Ambiguity comes from the double meaning of dots (abbreviation marker vs. sentence marker).

Spelling correction

● Non-word error detection (e.g. flex example seen before).

● Isolated-word error correction, e.g. graffe → giraffe.

● In-context error correction (examples using a search engine)

Spelling correction

● Isolated word error correction:○ Three operations allowed:

■ Insertion■ Deletion■ Substitution

○ Find the closest dictionary word in terms of edit distance operations needed to go from the observed non-dictionary word to a dictionary word.

Homework

● The next class is part-of-speech tagging. You can read Chapter 5 from the book.

● There are many natural language toolkits on the web (NLTK, Stanford CoreNLP, LingPipe, OpenNLP, etc.):○ Try downloading one of them and running it on the

DUC texts until you are satisfied with the tokenization, lemmatization and sentence splitting output.

○ Alternatively, try writing your own tokenization and sentence splitting module from scratch!

speech and language processing - systems group · notions like noun and verb. ... regulars and...

Documents