6/10/2015cpsc503 winter 20091 cpsc 503 computational linguistics lecture 3 giuseppe carenini
TRANSCRIPT
NLP research at UBCTOPICS• Generation and Summarization of Evaluative
Text (e.g., customer reviews)• Summarization of conversations (emails,
blogs, meetings)• Subjectivity Detection, Domain Adaptation,
Rhetorical Parsing
PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students
SUPPORT: NSERC, Google, BObjects(now SAP), COLLABORATIONS: MSResearch
04/18/23 CPSC503 Winter 2009 2
http://people.cs.ubc.ca/~rjoty/Webpage/
04/18/23 CPSC503 Winter 2009 3
State Machines (no prob.)• Finite State Automata
(and Regular Expressions)
• Finite State Transducers
(English)Morpholo
gy
Logical formalisms (First-Order Logics)
Rule systems (and prob. version)(e.g., (Prob.) Context-Free
Grammars)
Syntax
PragmaticsDiscourse and
Dialogue
Semantics
AI planners
Linguistic Knowledge Formalisms and associated Algorithms
04/18/23 CPSC503 Winter 2009 4
Computational tasks in Morphology
• Recognition: recognize whether a string is an English/… word (FSA)
• Parsing/Generation: word
stem, class, lexical features
….….
boughtbuy +V +PAST-PART
buy +V +PAST• Stemming:
wordstem
….
e.g.,
04/18/23 CPSC503 Winter 2009 5
Today Sept 16
• Finite State Transducers (FSTs) and Morphological Parsing
• Stemming (Porter Stemmer)
04/18/23 CPSC503 Winter 2009 6
FST definition (Recap.)
• Q: a finite set of states• I,O: input and an output alphabets
(which may include ε)• Σ: a finite alphabet of complex symbols
i:o, iI and oO
• Q0: the start state
• F: a set of accept/final states (FQ)• A transition relation δ that maps QxΣ
to 2Q
E.g., |Q| =3 ; I={a,b,c, ε} ; O={a,b}; |Σ|=?; 0 <= |δ| <= ?
04/18/23 CPSC503 Winter 2009 7
FST can be used as…
• Translators: input one string from I, output another from O (or vice versa)
• Recognizers: input a string from IxO
• Generator: output a string from IxO Terminology
warning!E.g., if I={a,b} ; O={a,b,ε};
……
04/18/23 CPSC503 Winter 2009 8
FST: inflectional morphology of plural
Some regular-nouns
Some irregular-nouns o:i
X -> X:X
lexical:surface
Notes:
04/18/23 CPSC503 Winter 2009 10
Computational Morphology: Problems/Challenges
1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages)
2. Spelling changes: may occur when two morphemes are combinede.g. butterfly + -s -> butterflies
04/18/23 CPSC503 Winter 2009 11
Ambiguity: more complex example
• What’s the right parse for Unionizable?– Union-ize-able– Un-ion-ize-able
• Each would represent a valid path through an FST for derivational morphology.
• Both Adj……
04/18/23 CPSC503 Winter 2009 12
Deal with Morphological Ambiguity
•Find all the possible outputs (all paths) and return them all (without choosing)Then Part-of-
speech taggingto choose…… look at the neighboring words
04/18/23 CPSC503 Winter 2009 13
(2) Spelling Changes
When morphemes are combined inflectionally the spelling at the boundaries may change Examples
•E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box)
•Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., butterfly, try)
04/18/23 CPSC503 Winter 2009 14
Solution: Multi-Tape Machines
• Add intermediate tape • Use the output of one tape
machine as the input to the next
• Add intermediate symbols– ^ morpheme boundary– # word boundary
04/18/23 CPSC503 Winter 2009 15
Multi-Level Tape Machines
• FST-1 translates between the lexical and the intermediate level
• FTS-2 handles the spelling changes (due to one rule) to the surface tape
FST-1
FST-2
04/18/23 CPSC503 Winter 2009 16
FST-1 for inflectional morphology of plural (Lexical <->
Intermediate )Some regular-nouns
Some irregular-nouns o:i
+PL:^s#
#
#
#
+PL:^ ε:s ε:#
04/18/23 CPSC503 Winter 2009 17
Example
f o x
intemediate
lexical
s em o u
intemediate
lexical
+PL+N
+N +PL
04/18/23 CPSC503 Winter 2009 18
FST-2 for E-insertion(Intermediate <-> Surface)
E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x
…as in fox^s# <-> foxes
#: ε
04/18/23 CPSC503 Winter 2009 19
Examples
^ sf o xintermediate
surface
#
^ ib o xintermediate
surface
n g #
04/18/23 CPSC503 Winter 2009 23
Intersection (FST1, FST2) = FST3
For all i,j,n,m,a,b δ3((q1i,q2j), a:b) = (q1n,q2m) iff
– δ1(q1i, a:b) = q1n AND
– δ2(q2j, a:b) = q2m
• States of FST1 and FST2 : Q1 and Q2
• States of intersection: (Q1 x Q2)
• Transitions of FST1 and FST2 : δ1, δ2
• Transitions of intersection : δ3
a:b
(q1i,q2j) (q1n,q2m
)
a:b
q1i q1n
a:b
q2j q2m
a:b
04/18/23 CPSC503 Winter 2009 24
Composition(FST1, FST2) = FST3 • States of FST1 and FST2 : Q1 and Q2
• States of composition : Q1 x Q2
• Transitions of FST1 and FST2 : δ1, δ2
• Transitions of composition : δ3
For all i,j,n,m,a,b δ3((q1i,q2j), a:b) = (q1n,q2m) iff– There exists c such that
– δ1(q1i, a:c) = q1n AND
– δ2(q2j, c:b) = q2ma:b
(q1i,q2j) (q1n,q2m
)
a:b
a:c
q1i q1n
c:b
q2j q2m
04/18/23 CPSC503 Winter 2009 25
FSTs in Practice• Install an FST package…… (pointers)• Describe your “formal language” (e.g,
lexicon, morphotactic and rules) in a RegExp-like notation (pointer)
• Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and
Karttunen, 2003, CSLI Publications)
Complexity/Coverage: • FSTs for the morphology of a natural
language may have 105 – 107 states and arcs
• Spanish (1996) 46x103 stems; 3.4 x 106 word forms
• Arabic (2002?) 131x103 stems; 7.7 x 106 word forms
04/18/23 CPSC503 Winter 2009 26
Other important applications of FST in NLP
From segmenting words into morphemes to…
• Tokenization:
– finding word boundaries in text (?!) …maxmatch
– Finding sentence boundaries: punctuation… but . is ambiguous look at example in Fig. 3.22
• Shallow syntactic parsing: e.g., find only noun phrases
• Phonological Rules…… (Chpt. 11)
04/18/23 CPSC503 Winter 2009 27
Computational tasks in Morphology
• Recognition: recognize whether a string is an English word (FSA)
• Parsing/Generation: word
stem, class, lexical features
….….
boughtbuy +V +PAST-PART
buy +V +PAST• Stemmin
g:wordstem
….
e.g.,
04/18/23 CPSC503 Winter 2009 28
Stemmer• E.g. the Porter algorithm, which is
based on a series of sets of simple cascaded rewrite rules:
• (condition) S1->S2– ATIONAL ATE (relational relate)– (*v*) ING if stem contains vowel (motoring
motor)
• Cascade of rules applied to: computerization– ization -> -ize computerize– ize -> ε computer
• Errors occur:– organization organ, university universe
Code freely available in most languages: Python, Java,…
04/18/23 CPSC503 Winter 2009 29
Stemming mainly used in Information Retrieval
1. Run a stemmer on the documents to be indexed
2. Run a stemmer on users queries3. Compute similarity between
queries and documents (based on stems they contain)
Seems to work especially well with smaller documents
04/18/23 CPSC503 Winter 2009 30
Porter as an FST
• The original exposition of the Porter stemmer did not describe it as a transducer but…– Each stage is a separate
transducer– The stages can be composed to
get one big transducer