natural language processing - arvutiteaduse instituut · 2017-10-13 · natural language...
TRANSCRIPT
MorphologyNatural Language Processing: Lecture 6
12.10.2017
Kairit Sirts
Morphology
• Morphology studies the internal structure of words
2
availabilities avail +able +ity +es
availability NOUN +nom +pl
avail_abil_iti_es
esavail able ity
Morphemes
• Morphemes are the smallest units of language that carry a semantic meaning
3
availabilities avail +able +ity +es
Verbal root
Derivational suffixthat transforms averb into an adjective
Derivational suffixthat transforms anadjective into a noun
Nominative pluralInflectional suffix
Roots and stems
• Roots carry the basic indivisible meaning of a word
• A stem is the base of an inflected word
• Avail is a root – it cannot be divided further
• Available is a stem
• Availability is a stem
• Availabilities is an inflected word
4
availabilitiesavail +able +ity +es
Exercise
• What could be the morphological analysis of the following words? Identify the roots and stems.
5
readers
carefully
ate
read +er +s
care +ful +ly
ate
Root Stems
read read, reader
care care, careful
eat/ate ate
Lexical and grammatical morphemes
• Lexical morphemes carry themselves a semantic meaning. Most of them can stand on their own.• Boy, table, yellow, run, waste etc
• Grammatical morphemes cannot stand on their own. Their role is to modify the meaning of a lexical morpheme or specify the relationships between lexical morphemes• -s, -ing, -able, at, in, on
6
Derivational and inflectional morphemes
• Derivational morphemes change the semantic meaning of a word, they can also change the POS of the word
• Inflectional morphemes change:• The number, gender, case etc of nouns, adjectives etc
• The person, number, mood, tense etc of verbs
7
avail VERB +able available ADJ +ity availability NOUN
dog +pl +gen dog +s +’turn +past turn +ed
Affixes
Grammatical morphemes are divided according to their attachment location:
• Suffixes attach to the end of the word:• Avail+able, table+s, go+ing
• Prefixes attach to the beginning of the word:• re+animate, mis+understand, non+compliant
• Circumfixes have two parts, one attaches in the beginning of the word and the other to the end:• In German: ge+komm+en, ge+spiel+t
• Infixes attach in the middle of the word:• In German: an+zu+fangen, ab+ge+fahren
8
Compounding
• Compounding is using two or more words (typically nouns) together to form a new meaning.
• In English, compounds can be written together:• notebook, bookstore, fireman
• … or separately• Living room, dinner table, full moon
• In other languages, compounds are usually written together:• In Estonian: täis_kuu (full moon), elu_tuba (living room)
• In some languages compounding can be very productive• In German: Kraft_fahr_zeug-Haft_pflicht_versicherung (motor car liability insurance)
9
Tasks in computational morphology
• Text normalization: • Stemming
• lemmatization
• Morphological analysis/parsing
• Morphological tagging/disambiguation
• Morphological generation
10
Stemming
• Used in information retrieval to reduce sparsity• communication communicat• When searching for “communication” retrieve articles containing both
“communication”, “communications”, “communicate”, “communicated”, “communicating”, “communicates” etc• Conflation set
• The most well-known stemming system is Snowball• Small string-based language that can be used to define stemming rules• A compiler compiles the rules into C or Java program
• Course project idea: develop a Snowball stemmer (prototype) for your native language
11
What are the stems of the following words
12
agreement
rotting
rolling
new
news
probe
prove
provable
probable
agreement
rot
roll
new
news
probe
prove
prove
probable
Lemmatization
• Linguistically motivated task
• The word swimming can have three lemmas: • swim (VERB)
• swimming (ADJ)
• swimming (NOUN)
13
Morphological parsing or stemming applies to many affixes other than plurals
Morphological/ADJ
parsing/NOUN
stemming/NOUN
apply/VERB
affix/NOUN plural/NOUN
Morphological parsing/analysis
• The task of finding all possible morphological tags for a word
• A morphological parse consists of:• Lemma/stem
• POS
• Morphological attributes/features
• Question:
Should morphological parsing be done in context or can you parse each word in isolation?
• Answer: Can be done in isolation
14
kestnudkest+nud //_V_ nud, //kest=nu+d //_S_ pl n, //kest=nud+0 //_A_ //kest=nud+0 //_A_ sg n, //kest=nud+d //_A_ pl n, //
(something) has lasted
Finite state morphological parsing
• The classical method for morphological parsing are finite state methods
• Two-level morphology (Koskenniemi, 1983)
• Resources necessary for developing a finite-state morphological parser:• Lexicon – a list of stems and affixes
• Morphotactics rules – which morphemes can occur with each POS and what is the ordering of the morphemes
• Orthographic rules – describing the stem changes: city+s cities
15
Morphotactics model as a finite state acceptor
16
How to accept words: caught, sing, walked, eating, speaks?Σ – alphabetQ – set of statesS – set of start statesE – set of end statesδ – state transition function
Morphological word recognition
• Does a word belong to the language?
• Lexicon + morphotactics
17Figure 3.7: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
Morphological parser as a finite state transducer
18
FST modeling English noun morphology
How can we analyse the words: boy, girls, woman’s, students’?
Σ – input alphabetΔ – output alphabetQ – set of statesS – set of start statesE – set of end statesδ – state transition function⍵ - output function
Operations with FSTs
• Inversion – switches the input and output labels
• What happens when we invert a morphological parser?
• Composition:• If T1 maps from I to O1
• And if T2 maps from O1 to O2
• Then the composition of T1 and T2 maps I to O2.
• Intersection:• Combine parallel FST’s with cartesian product
19
FST for Morphological parsing
20Figure 3.12: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
• Parse the word cats into cat +N +Pl
• Input alphabet: the set of surface level symbols
• Output alphabet: the set of lexical level symbols
• How could we make this FST into an FSA?
FST for morphological parsing
• Compose the lexicon and the parser
21Figure 3.13: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
FST for morphological parsing
• How to handle irregular words? geese goose +N +Pl
22Figure 3.14: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
FST for morphological parsing
• What would be the output if the input is fox +N +Pl?
23
FST for morphological parsing
• What would be the output if the input is fox +N +Pl?
• We still need to get rid of morpheme boundaries (^) and word boundaries (#), and convert foxs to foxes
24Figure 3.15: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
FSTs for spelling rules
• E insertion rule - add e after -s, -z, -x, -ch, -sh before –s
25Figure 3.16: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
FSTs for spelling rules
• Can you think of any other spelling rules for English?
26https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf, page 63
Combining lexicon, parser and rules
• Each spelling rule has its own FST, they are applied in parallel
27Figure 3.19: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
Compiling the final model
• Combine with intersection and compose:
28Figure 3.21: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf
Morphological tagging
1. Morphological parsing – find all possible analyses for each word
2. Morphological disambiguation – find the most likely analysis sequence
29
Teise koha sai seekord Jänes
O+adtO+sg_gP+adtP+sg_g
S+sg_gS+sg_nS+sg_pV+o
V+sS+sg_n
D S+sg_nH+sg_n
Sequence tagging with CRF
Tools
• CRFsuite
• CRF++
• CRFSharp
30Figure adopted from: https://www.researchgate.net/figure/263324470_fig20_Figure-1-Linear-chain-conditional-random-fields-modelBlack-nodes-represent-observable
Morphological tagging with CRF
• MarMot – Müller et al., 2013. Efficient higher-order CRFs for Morphological Tagging
• Features:• The current, preceding and next words as unigrams and bigrams
• Word prefixes and suffixes up to 10 characters for rare words
• The occurrence of capital characters, digits and special characters
• Rare word has frequency <= 10 in training set
• Combine all features with the POS tag, morphological tag and morphological feature
31
Construct word unigram features
• Старим +ADJ +Case=Ins|Degree=Pos|Gender=Masc
• Combine all features with the POS tag, morphological tag and all morphological features
• Старим+ADJ
• Старим+Case=Ins|Degree=Pos|Gender=Masc
• Старим+Case=Ins
• Старим+Degree=Pos
• Старим+Gender=Masc
32
Neural morphological tagging
• SyntaxNet – Andor et al., 2016. Globally normalized transition-based neural networks
• Pretrained models for 40 languages
33
Language Tokens Universal POS Language POS Morph
English 25096 90.48% 89.71% 91.30%
Estonian 23670 95.92% 96.76% 92.73%
Finnish 9140 94.78% 95.84% 92.42%
German 16268 91.79% - -
Kazakh 587 75.47% 75.13%
Chinese 12012 91.32% 90.89% 98.76%
Latvian 3985 80.95 66.60% 73.60%
Average 94.27% 92.93% 90.38%
Morphological segmentation
• Simplest morphological analysis
• Was very popular during the first decade of 2000s.
• Mostly unsupervised or weakly supervised methods• Minimum Description length principle• Bayesian models• Also sequence tagging with CRF
34
avail_abil_iti_es
Minimum Description Length
• Occam’s razor
• We want to maximise:
• Minimise the description length:
35
MDL in morphological segmentation
• Model – the inventory of possible morphemes
• Data – the corpus to be segmented
• -log P(model) – the description length of a model
• -log P(Data|Model) – the description length of the data encoded with a model
36
Question
• Which model is better for morphological segmentation according to MDL?• ”Morphemes” consist of single characters: a, b, c, …
• “Morphemes” consist of whole words
• References• Goldsmith, 2001. Unsupervised learning of morphology of a natural language
• Morfessor – Creutz and Lagus, 2007. Unsupervised models for morpheme segmentation and morphology learning
37
Bayesian morphological segmentation
• Model the posterior probability:
• Sirts and Goldwater, 2012. Minimally-supervised morphological segmentation using Adaptor Grammars
38
Word
Morph Morph
avail
Morph Morph
abil ity es
P(Word --> Morph+) = …P(Morph --> Char+) = …P(Char --> a) = …P(Char --> v) = ……
Segmentation with sequence modeling
• Ruokolainen et al., 2013. Supervised morphological segmentation in a low-resource setting using Conditional Random Fields
• B – morpheme begins
• M – inside the morpheme
• E –morpheme ends
• Question: Could the labelling scheme be simpler?
39
drivers driv_er_s
Morphological paradigms
40http://wwwdrshadiabanjar.blogspot.com.ee/2012/05/lane-333-inflectional-paradigms.htmlhttp://wwwdrshadiabanjar.blogspot.com.ee/2012/05/lane-333-inflectional-paradigms.html
Tasks based on morphological paradigms
• Morphological synthesis or paradigm generation• Estonian morphological synthesiser
• Paradigm completion• Given a partially filled paradigm, fill in the remaining slots
• Morphological reinflection• Given a lemma or an inflected from generate another inflected form
• run (lemma) --> running (present participle)
• ran (past) --> running (present participle)
41
Neural model for morphological reinflection
• Kann and Schütze, 2016. MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection
• Bidirectional gated RNN
• Attention mechanism
• Input (in one sequence):
• Output:
42
<w> IN=pos=ADJ IN=case=GEN IN=num=PL OUT=pos=ADJ OUT=case=ACC OUT=num=PL i s o l i e r t e r </w>
<w> i s o l i e r t e </w>
The role of computational morphology in NLP
• Still largely an unexplored area. Why?
• Intuitively, modeling morphology should help to reduce the vocabulary sparsity problems
• Morphological agreement:• For instance, verbs and nouns must agree in number• In German: der Mann geht vs die Männer gehen (the man goes vs men go)
• Potentially should be useful for many downstream tasks:• Machine translation• Natural language generation• Language modeling (for speech recognition)
43
SIGMORPHON Shared Tasks
• 2016 – Morphological reinflection
• 2017 – Universal morphological reinflection in 52 languages
• 2018 - TBA
• UniMorph initiative• Collecting morphologically annotated lexicon data for many languages
• Currently features 51 languages
• Many more are being annotated – you’re welcome to contribute!
44
Recap
• Morphology is a study about internal structure of words
• It overlaps with syntax and semantics:• “I put the book on the table” vs “Panin raamatu lauale” (in Estonian)
• “Giraffe bit the zebra” vs “Kaelkirjak hammustas sebrat” or “Sebrathammustas kaelkirjak”
• Several computational morphological analysis tasks• Parsing, tagging/disambiguation, segmentation, generation, reinflection
• How to use morphological models in downstream tasks?
45