speech and language processing - systems group · notions like noun and verb. ... regulars and...

57
Speech and Language Processing Lecture 3 Chapter 3 of SLP

Upload: lamkhue

Post on 19-Sep-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Speech and Language Processing

Lecture 3Chapter 3 of SLP

Page 2: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Today

● English Morphology● Finite-State Transducers (FST)● Morphology using FST● Example with flex● Tokenization, sentence breaking● Spelling correction, edit distance

Page 3: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Words

● Finite-state methods are particularly useful in dealing with a lexicon.

● Many devices, most with limited memory, need access to large lists of words

● And they need to perform fairly sophisticated tasks with those lists

● So we’ll first talk about some facts about words and then come back to computational methods

Page 4: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

English Morphology

● Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes

● We can usefully divide morphemes into two classes○ Stems: The core meaning-bearing units○ Affixes: Bits and pieces that adhere to stems to

change their meanings and grammatical functions

Page 5: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

English Morphology

● We can further divide morphology up into two broad classes:

○ Inflectional○ Derivational

Page 6: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Word Classes

● By word class, we have in mind familiar notions like noun and verb.

● We’ll go into more details in the next lesson (part-of-speech tagging).

● Right now we’re concerned with word classes because the way that stems and affixes combine is based to a large degree on the word class of the stem

Page 7: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Inflectional Morphology

● Inflectional morphology concerns the combination of stems and affixes where the resulting word:○ Has the same word class as the original○ Serves a grammatical/semantic purpose that is

■ Different from the original■ But is nevertheless transparently related to the

original

Page 8: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Nouns and Verbs in English

● English nouns are simple○ Markers for plural and possessive.○ More complicated in other languages:

■ Gender (masculine, femenine, neutral, …)■ Cases (nominative, accusative, vocative, …)■ Agglutinative languages.■ Etc.

● Verbs are only slightly more complex○ Markers appropriate to the tense of the verb.○ Again, more complicated in other languages.

Page 9: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Regulars and Irregulars

● It is a little complicated by the fact that some words do not follow the rules:

● For nouns:○ Rule for plural: add -s (-es, -(i)es, ...)○ Exceptions: Mouse/mice, goose/geese,

ox/oxen

Page 10: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Regulars and Irregulars

● For verbs:

● Other verbs do not follow the rules, e.g.■ Go/went, fly/flew■ Be / was / were / am / are / is / been / being

● The terms regular and irregular are used to refer to words that follow the rules and those that don’t

Morphological class

Base walk merge map3rd. person walks merges maps-ing participle walking merging mappingPast, Past participle walked merged mapped

Page 11: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Regular and Irregular Verbs

● Regulars…○ Walk, walks, walking, walked, walked

● Irregulars○ Eat, eats, eating, ate, eaten○ Catch, catches, catching, caught, caught○ Cut, cuts, cutting, cut, cut

Page 12: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Inflectional Morphology

● So inflectional morphology in English is fairly straightforward.○ The regular words are productive, they apply

to new words.■ blogger■ To fax

● But morphology complicated by the fact that are irregularities

Page 13: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Derivational Morphology

● Derivational morphology is not as straightforward:○ Quasi-systematicity○ Irregular meaning change○ Changes of word class

Page 14: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Derivational Examples

Verbs and Adjectives to Nouns

-ation computerize computerization

-ee appoint appointee

-er kill killer

-ness fuzzy fuzziness

Page 15: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Derivational Examples

Nouns and Verbs to Adjectives

-al computation computational

-able embrace embraceable

-less clue clueless

Page 16: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Example: Compute

● Many paths are possible…

● Start with compute○ Computer → computerize → computerization

○ Computer → computerize → computerizable → computerizability

● But not all paths/operations are equally good (allowable?)○ Clue

■ Clue -> *clueable

Page 17: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Morpholgy and FSAs

● We’d like to use the machinery provided by FSAs to capture these facts about morphology○ Accept strings that are in the language.○ Reject strings that are not.○ And do so in a way that doesn’t require us to

in effect list all the words in the language.

Page 18: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Start Simple

● Regular singular nouns are ok● Regular plural nouns have an -s on the

end● Irregulars are ok as is

Page 19: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Simple Rules

Page 20: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Now Plug in the Words

Page 21: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Derivational Rules

If everything is an accept state how do things ever get rejected?

Page 22: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Parsing/Generation vs. Recognition

● We can now run strings through these machines to recognize strings in the language

● But recognition is usually not quite what we need ○ Often if we find some string in the language we might

like to assign a structure to it (parsing)○ Or we might have some structure and we want to

produce a surface form for it (production/generation)

● Example○ From “cats” to “cat +N +PL”

Page 23: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Finite State Transducers

● The simple story○ Add another tape○ Add extra symbols to the transitions

○ On one tape we read “cats”, on the other we write “cat +N +PL”

Page 24: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

FSTs

Page 25: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Applications

● The kind of parsing we’re talking about is normally called morphological analysis

● It can either be ○ An important stand-alone component of

many applications (spelling correction, information retrieval)

○ Or simply a link in a chain of further linguistic analysis

Page 26: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Transitions

● c:c means read a c on one tape and write a c on the other

● +N:ε means read a +N symbol on one tape and write nothing on the other

● +PL:s means read +PL and write an s

c:c a:a t:t+N: ε +PL:s

Page 27: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Typical Uses

● Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).

● And we’ll write to the second tape using the other symbols on the transitions.

Page 28: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Ambiguity

● Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state.○ Didn’t matter which path was actually

traversed● In FSTs the path to an accept state does

matter since different paths represent different parses and different outputs will result.

Page 29: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Ambiguity

● What’s the right parse (segmentation) for○ Unionizable○ Union-ize-able○ Un-ion-ize-able

● Each represents a valid path through the derivational morphology machine.

Page 30: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Ambiguity

● There are a number of ways to deal with this problem○ Simply take the first output found○ Find all the possible outputs (all paths) and

return them all (without choosing)○ Bias the search so that only one or a few

likely paths are explored

Page 31: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Some more details

● Of course, its not as easy as “cat +N +PL” <-> “cats”

● As we saw earlier there are geese, mice and oxen

● But there are also a whole host of spelling/pronunciation changes that go along with inflectional changesCats vs DogsFox and Foxes

Page 32: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Multi-Tape Machines

● To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next

● So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols

Page 33: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Multi-Level Tape Machines

We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

Page 34: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Lexical to Intermediate Level

Page 35: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Intermediate to Surface

●The add an “e” rule as in fox^s# <-> foxes#

Page 36: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Foxes

Page 37: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Note

● A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply.

● Meaning that they are written out unchanged to the output tape.

Page 38: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Overall Scheme

● We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity).○ Lexical level to intermediate forms

● We have a larger set of machines that capture orthographic/spelling rules.○ Intermediate forms to surface forms

Page 39: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Overall Scheme

Page 40: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Cascades

● This is an architecture that we’ll see again and again○ Overall processing is divided up into distinct

rewrite steps○ The output of one layer serves as the input

to the next○ The intermediate tapes may or may not wind

up being useful in their own right

Page 41: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Composition

1. Create a set of new states that correspond to each pair of states from the original machines (New states are called (x,y), where x is a state from M1, and y is a state from M2)

2. Create a new FST transition table for the new machine according to the following intuition…

Page 42: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Composition

There should be a transition between two states in the new machine if it’s the case that the output for a transition from a state from M1, is the same as the input to a transition from M2 or…

Page 43: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Composition

● δ3((xa,ya), i:o) = (xb,yb) iff○ There exists c such that○ δ1(xa, i:c) = xb AND○ δ2(ya, c:o) = yb

Page 44: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Example using flex

[demonstrated in the class]

● You can use “aspell dump master” or similar commands to dump an aspell dictionary. They are available for many languages!

● Using it, it is possible to construct a finite automaton that recognizes the disjunction of all these words.

Page 45: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Example using flexExample:

/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>%}

SPACE [ \t]NO_SPACE [^ \t]%%

{SPACE}* printf("\n");W printf("%s: valid\n", yytext);w printf("%s: valid\n", yytext);WW printf("%s: valid\n", yytext);WWW printf("%s: valid\n", yytext);WY printf("%s: valid\n", yytext);(…) // The aspell dump can be copied here{NO_SPACE}* printf("%s: invalid\n", yytext);%%

void main(int argc, char* argv[]) { yylex();}

Page 46: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Example using flexExample 2: it is possible to create a disjunction of the words

/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>%}

SPACE [ \t]NO_SPACE [^ \t]%%

{SPACE}* printf("\n");W|w|WW|WWW|... printf("%s: valid\n", yytext);{NO_SPACE}* printf("%s: invalid\n", yytext);%%

void main(int argc, char* argv[]) { yylex();}

Page 47: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Example using flex

● Compiling and running:

$ flex file.flex$ gcc lex.yy.c$ ./a.out

● Why can't we use all the dictionary dump?○ Flex transforms the regular expression into a Non-deterministic

finite automata with one path per dictionary words. With 100,000 words, there are hundreds of thousands of states.

○ Then it makes it deterministic (which usually makes the number of states bigger, remember the algorithm)

○ Finally it minimizes the automaton making it more compact.○ During these intermediate calculations, if we use the whole

dictionary dump, the number of states is too large for flex to handle (but it should be computationally feasible).

Page 48: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Example using flexExample 3: instead of recognizing the words, we can lemmatize them.

/* A flex scanner to recognize dictionary words */%option noyywrap %{#include <stdio.h>

// Define ChangeSuffix(n1, ending) as a function that copies yytext, removes n1 characters from the end, appends “ending” to it and prints the result to stdout.%}

SPACE [ \t]NO_SPACE [^ \t]AlphaNum [a-zA-Z0-9]*%%{AlphaNum}men ChangeSuffix(2, “an”); // men → man{AlphaNum}lves ChangeSuffix(3, “f”); // wolves → wolf{AlphaNum}ives ChangeSuffix(3, “fe”); // wives → wife...%%

void main(int argc, char* argv[]) { yylex();}

Page 49: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Tokenization

Mr. Sherwood said reaction to Sea Containers' proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.

Page 50: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Tokenization

Mr. Sherwood said reaction to Sea Containers' proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.

[sequence of alphanumeric symbols]

Page 51: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Tokenization

Mr. Sherwood said reaction to Sea Containers' proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.

[separated by whitespaces]

Page 52: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Tokenization

Mr. Sherwood said reaction to

Sea Containers ' proposal has been “ very positive . ” In

New York Stock Exchange composite trading yesterday ,

Sea Containers closed at $62.625 , up 62.5 cents.

[correct?]

Page 53: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Tokenization

● In English, usually solved with simple regular expressions handling:○ Possessive markers and clitics.○ Numbers.○ Abbreviations.○ Quotations and punctuation.○ Combined later on with Named Entity

recognizers.

● In languages without whitespaces, very different methods are used.

Page 54: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Sentence splitting

● Identification of where sentences start and end.

● Some sentence markers are usually unambiguous, e.g. ? or !

● Ambiguity comes from the double meaning of dots (abbreviation marker vs. sentence marker).

Page 55: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Spelling correction

● Non-word error detection (e.g. flex example seen before).

● Isolated-word error correction, e.g. graffe → giraffe.

● In-context error correction (examples using a search engine)

Page 56: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Spelling correction

● Isolated word error correction:○ Three operations allowed:

■ Insertion■ Deletion■ Substitution

○ Find the closest dictionary word in terms of edit distance operations needed to go from the observed non-dictionary word to a dictionary word.

Page 57: Speech and Language Processing - Systems Group · notions like noun and verb. ... Regulars and Irregulars For verbs: Other verbs do not follow the rules, e.g. ... remember the algorithm)

Homework

● The next class is part-of-speech tagging. You can read Chapter 5 from the book.

● There are many natural language toolkits on the web (NLTK, Stanford CoreNLP, LingPipe, OpenNLP, etc.):○ Try downloading one of them and running it on the

DUC texts until you are satisfied with the tokenization, lemmatization and sentence splitting output.

○ Alternatively, try writing your own tokenization and sentence splitting module from scratch!