fall 2004 lecture notes #2 eecs 595 / ling 541 / si 661 natural language processing

47
Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing

Upload: leo-jonah-white

Post on 29-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Fall 2004

Lecture Notes #2

EECS 595 / LING 541 / SI 661

Natural Language Processing

Course logistics• Instructor: Prof. Dragomir Radev ([email protected])• Class times: Tu 1:10-3:55 PM, in 412, WH• Office hours: M 10-11, Tu 11-12 in 3080, WH

http://www.si.umich.edu/~radev/NLP-fall2004

Home page:

Regular Expressions andAutomata

Regular expressions

• Searching for “woodchuck”

• Searching for “woodchucks” with an optional final “s”

• Regular expressions

• Finite-state automata (singular: automaton)

Regular expressions• Basic regular expression patterns• Perl-based syntax (slightly different from other notations for regular expressions)• Disjunctions [abc]• Ranges [A-Z]• Negations [^Ss]• Optional characters ? and *• Wild cards .• Anchors ^ and $, also \b and \B• Disjunction, grouping, and precedence |

fig02.01.pdf

fig02.02.pdf

fig02.03.pdf

fig02.04.pdf

fig02.05.pdf

Writing correct expressions

• Exercise: write a Perl regular expression to match the English article “the”:

/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

A more complex example

• Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:

/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

Advanced operators

fig02.06.pdf fig02.07.pdf fig02.08.pdf

Substitutions and memory

• Substitutions

• Memory (\1, \2, etc. refer back to matches)

s/colour/color/

s/([0-9]+)/<\1>/

Eliza [Weizenbaum, 1966]

User: Men are all alike

ELIZA: IN WHAT WAY

User: They’re always bugging us about something or other

ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?

User: Well, my boyfriend made me come here

ELIZA: YOUR BOYFRIEND MADE YOU COME HERE

User: He says I’m depressed much of the time

ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Eliza-style regular expressions

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Step 1: replace first person references with second person referencesStep 2: use additional regular expressions to generate repliesStep 3: use scores to rank possible transformations

Finite-state automata

• Finite-state automata (FSA)

• Regular languages

• Regular expressions

Finite-state automata (machines)

baa!baaa!baaaa!baaaaa!...

q0 q1 q2 q3 q4

b a a !

a

baa+!

state transition finalstate

Input tape

a b a ! b

q0

Finite-state automata

• Q: a finite set of N states q0, q1, … qN

: a finite input alphabet of symbols

• q0: the start state

• F: the set of final states(q,i): transition function

State-transition tables

Input

State b a !

0 1 0 0

1 0 2 0

2 0 3 0

3 0 3 4

4 0 0 0

The FSM toolkit and friends

• Developed at AT&T Research (Riley, Pereira, Mohri, Sproat)

• Download: http://www.research.att.com/sw/tools/fsm/tech.htmlhttp://www.research.att.com/sw/tools/lextools/

• Tutorial available• 4 useful parts: FSM, Lextools, GRM, Dot (separate)

– /clair3/tools/fsm-3.6/bin– /clair3/tools/lextools/bin– /clair3/tools/dot/bin

D-RECOGNIZEfunction D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end

Adding a failing state

q0 q1 q2 q3 q4

b a a !

a

qFa

!

b

! b ! b

b

a

!

Languages and automata

• Formal languages: regular languages, non-regular languages

• deterministic vs. non-deterministic FSAs• Epsilon () transitions

fig02.10.pdf fig02.18.pdf fig02.19.pdf

Using NFSAs to accept strings

• Backup: add markers at choice points, then possibly revisit underexplored markers

• Look-ahead: look ahead in input

• Parallelism: look at alternatives in parallel

Using NFSAs

Input

State b a ! 0 1 0 0 0

1 0 2 0 0

2 0 2,3 0 0

3 0 0 4 0

4 0 0 0 0

More about FSAs

• Transducers

• Equivalence of DFSAs and NFSAs

• Recognition as search: depth-first, breadth-search

Recognition using NFSAs

fig02.21.pdf fig02.22.pdf fig02.23.pdf

Regular languages

• Operations on regular languages and FSAs: concatenation, closure, union

• Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)

fig02.24.pdf fig02.25.pdf fig02.26.pdf

An exercise

• J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.

fig02.27.pdf

Morphology andFinite-State Transducers

Morphemes

• Stems, affixes• Affixes: prefixes, suffixes, infixes: hingi

(borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German

• Concatenative morphology • Templatic morphology (Semitic languages): lmd (learn), lamad (he studied), limed (he

taught), lumad (he was taught)

Morphological analysis

• rewrites

• unbelievably

Inflectional morphology

• Tense, number, person, mood, aspect

• Five verb forms in English

• 40+ forms in French

• Six cases in Russian:http://www.departments.bucknell.edu/russian/language/case.html

• Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)

fig03.01.pdf

Derivational morphology

• Nominalization: computerization, appointee, killer, fuzziness

• Formation of adjectives: computational, embraceable, clueless

Finite-state morphological parsing

• Cats: cat +N +PL• Cat: cat +N +SG• Cities: city +N +PL• Geese: goose +N +PL• Ducks: (duck +N +PL) or (duck +V +3SG)• Merging: +V +PRES-PART• Caught: (catch +V +PAST-PART) or (catch +V

+PAST)

Principles of morphological parsing

• Lexicon

• Morphotactics (e.g., plural follows noun)

• Orthography (easy easier)

• Irregular nouns: e.g., geese, sheep, mice

• Irregular verbs: e.g., caught, ate, eate

fig03.02.pdf fig03.03.pdf

FSA for adjectives

• Big, bigger, biggest• Cool, cooler, coolest, coolly• Red, redder, reddest• Clear, clearer, clearest, clearly, unclear, unclearly• Happy, happier, happiest, happily• Unhappy, unhappier, unhappiest, unhappily• What about: unbig, redly, and realest?

fig03.04.pdf fig03.05.pdf

Using FSA for recognition

• Is a string a legitimate word or not?

• Two-level morphology: lexical level + surface level (Koskenniemi 83)

• Finite-state transducers (FST) – used for regular relations

• Inversion and composition of FST

fig03.06.pdf fig03.07.pdf

fig03.08.pdf

fig03.09.pdf fig03.10.pdf

fig03.11.pdf fig03.12.pdf

fig03.13.pdf

Orthographic rules

• Beg/begging

• Make/making

• Watch/watches

• Try/tries

• Panic/panicked

#__^/ s

z

s

x

dcba __/

fig03.14.pdf fig03.15.pdf

Combining FST lexicon and rules

• Cascades of transducers:the output of one becomes the input of another

fig03.16.pdf fig03.17.pdf fig03.18.pdf

Weighted Automata

Phonetic symbols

• IPA

• Arpabet

• Examplesfig04.01.pdf

fig04.02.pdf

fig04.03.pdf

Using WFST for language modeling

• Phonetic representation

• Part-of-speech taggingfig05.12.pdf fig05.13.pdf

Word Classes andPart Of Speech Tagging

Some POS statistics

• Preposition list from COBUILD

• Single-word particles

• Conjunctions

• Pronouns

• Modal verbs

fig08.01.pdf

fig08.02.pdf

fig08.03.pdf

fig08.04.pdf

fig08.05.pdf

Tagsets for English

• Penn Treebank

• Other tagsets (see Week 1 slides)fig08.06.pdf

POS ambiguity

• Degrees of ambiguity (DeRose 1988)• Rule-based POS tagging

– ENGTWOL (Voutilainen et al. )

– Sample rule:• Adverbial-That rule (“it isn’t that odd”)

(“Given input: “that”if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”)then eliminate non-ADV tagselse eliminate ADV tag

fig08.07.pdf

fig08.08.pdf

Evaluating POS taggers

• Percent correct

• What is the lower bound on a system’s performance?

• What about the upper bound?

Kappa

• N: number of items (index i)

• n: number of categories (index j)

• k: number of annotators

• when > .8 – agreement is considered high

)(1

)()(

EP

EPAP

N

i

n

jij k

mkNk

AP1 1

2

1

1

)1(

1)(

2

1

1

)(

Nk

mEP

N

iijn

j

Readings for next time

• J&M Chapters 5.9, 8, 9

• Lecture notes #2