morphology - limsi
TRANSCRIPT
LIMSI-CNRS | 1/72
Fondements du TAL
Delphine Bernhard
Morphology
Contains slides adapted from Pierre Zweigenbaum
LIMSI-CNRS | 2/72
Outline
1. Word segmentation
2. Linguistic morphologya) Morphemesb) Morphological processes
3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 4/72
Outline
1. Word segmentation
2. Linguistic morphologya) Morphemesb) Morphological processes
3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 5/72
Can you read this (fast!)?
wikipédiaestunprojetdencyclopédiecollectiveétabliesurinternetuniversellemultilingueetfonctionnantsurleprincipeduwikiwikipédiaapourobjectifdoffriruncontenulibrementréutilisableneutreetvérifiablequechacunpeutéditeretaméliorer
Wikipédia est un projet d’encyclopédie collective établie sur Internet, universelle, multilingue et fonctionnant sur le principe du wiki. Wikipédia a pour objectif d’offrir un contenu librement réutilisable, neutre et vérifiable, que chacun peut éditer et améliorer.
LIMSI-CNRS | 6/72
It's only words ...
... but what are they exactly and how can we automatically recognise them?
In speech, there are no obvious breaks So how do babies learn words? According to (Saffran et al.,
1996) they use distributional cues and statistical regularities in speech
LIMSI-CNRS | 7/72
How do we recognise words in speech? (Bauer, 1988)
There are no gaps between words in speech: Menbecomeoldbuttheyneverbecomegood
Thanks to our knowledge of language, we recognise certain strings of sounds/letters: e.g. we can recognise men in the previous sequence because
it also comes up in sequences like:MenareconservativeafterdinnerMenlosetheirtempersindefendingtheirtaste.Afterfortymenhavemarriedtheirhabits.
LIMSI-CNRS | 8/72
Learning to read is difficult for humans
Reading disabilities: Dyslexia: inability to decode, or break down, words into
phonemes Comprehension difficulties
The invention of writing and reading is recent
Contrarily to speech or vision, it is anunnatural process that has to be learned: brains are not wired to read!
LIMSI-CNRS | 9/72
For computers: characters and strings
Control characters: End of line: \n Tabulation: \t
Encodings: ASCII: English alphabet Latin 1, ISO-8859-1: Western European Languages ISO-8859-15: Similar to ISO-8859-1, but replaces some less
common symbols with €, Œ or œ Windows-1252, Cp1252: superset of ISO 8859-1 (includes €,
Œ and œ) UTF-8: can represent every character in the Unicode
character set, backward-compatible with ASCII
LIMSI-CNRS | 10/72
Practical definition of wordsand sentences
Bauer (1988): A word is a unit which, in print, is bounded by spaces on both
sides. We will call this an orthographic word.
Kučera and Francis (1967): A graphic word is a string of contiguous alphanumeric
characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks
Grefenstette and Tapanainen (1994): Sentences end with punctuation.
LIMSI-CNRS | 11/72
What are the "words" and sentences here?
Pacific Lumber Co. was trying to figure out the safest way to bring the activists down.
He doesn't need us.
For additional information see also http://www.limsi.fr
New York is situated on the east coast of the United States.
c’est-à-dire les pommes de terre des U.S.A.
LIMSI-CNRS | 12/72
Tokenisation
Tokenisation: process which divides the input text into word tokens: punctuation marks, word-like units, numbers, etc.
A system which splits texts into word tokens is called a tokeniser
A very simple example: Input text:
John likes Mary and Mary likes John. Tokens:
{"John", "likes", "Mary", "and", "Mary", "likes", "John", "."}
LIMSI-CNRS | 13/72
Problems of tokenisation
Numeric expressions:The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.
How many words are there in 4.53 +/- 0.15%? 1 3 9 (“four point five three, plus or minus fifteen percent”) not a word
The answer depends on the application at hand
LIMSI-CNRS | 14/72
Problems of tokenisation
Boundaries
For "simple" words:SpacesPunctuation
Multiword expressions: several units, one wordpomme de terreparce que
Contracted forms: one unit, several wordsaux (à les), des (de les)
LIMSI-CNRS | 15/72
Outline
1. Word segmentation
2. Linguistic morphologya) What is morphology?b) Morphological processes
3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 16/72
Morphology
Words can be further decomposed into smaller units:
pneumonoultramicroscopicsilicovolcanoconiosis
lungextreme
microscopicsilicium
volcanodust
disease
lung disease caused by the inhalation of very fine silica dust found in volcanoes
LIMSI-CNRS | 17/72
What is morphology?
Morphology is the branch of linguistics which studies word forms and word formation
Word formation processes Inflection Derivation Composition / Compounding
LIMSI-CNRS | 18/72
Words vs. lexemes vs. lemmas
A lexeme refers to the set of word forms which correspond to the same dictionary entry
small, smaller, smallest → SMALLknife, knives → KNIFE
A lemma is the canonical form of a lexemeSMALL
In the following, capital letters are used to indicate lemmas
LIMSI-CNRS | 19/72
Inflection
Inflection is the process of forming different grammatical forms of a single lexememontrer → montreracheval → chevaux
The grammatical category of the word form remains the same
LIMSI-CNRS | 20/72
Word formation
Word formation is the process of creating new lexemes from existing ones: Derivation: combines bases and affixes Compounding: combines lexemes
LIMSI-CNRS | 21/72
Derivation
Derivation involves the creation of one lexeme from another
re- + create → RECREATEre- is a derivational prefix
recreate + s → recreates-s is an inflectional suffix, it provides another word-form of the lexeme RECREATE!
Derivation might induce a change of the grammatical category
be- + witch → BEWITCH: changes a noun into a verb
LIMSI-CNRS | 22/72
Compounding
A compound involves the creation of one lexeme from two or more other lexemes
popcorn = a kind of corn which popshot dog = a kind of food (opaque compound)
Compounding is particularly frequent in French medical language appendice + ectomie → appendicectomie
LIMSI-CNRS | 23/72
Non concatenative phenomena
Root-and-pattern morphology (e.g. Arabic, Hebrew) the root consists of consonants only (3 by default)
ktb = to write the pattern is a combination of vowels (possibly consonants
too) with slots for the root consonantskaatab = he corresponded
Apophony: vowel changes within a root Ablaut: sing, sang, sung Umlaut: Buch, Bücher
LIMSI-CNRS | 24/72
Outline
1. Word segmentation
2. Linguistic morphologya) Morphemesb) Morphological processes
3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 25/72
Morphological normalisation
Morphological normalisation consists in identifying a single canonical representative for morphologically related word-forms
Methods: Stemming Lemmatisation
LIMSI-CNRS | 26/72
Stemming
Stemming is an algorithmic approach to strip off the endings of words
Objective: group words belonging to the same morphological family by transforming them into a similar stemmed representation
Stemming does not distinguish between inflection and derivation
The stems obtained do not necessarily correspond to a genuine word form
The best known stemming algorithms have been developed by Lovins (1968) and Porter (1980)
LIMSI-CNRS | 27/72
Algorithmic stemming method
1) Desuffixing: removal of predefined word endingssitting → sitt
2) Recoding: transform the endings of the previously obtained stems using transformation rulessitt → sit
These 2 phases can be performed successively (Lovins) or simultaneously (Porter)
LIMSI-CNRS | 28/72
Porter's stemmer
Based on a limited set of general cascaded transformational rules:
-ational → -ate : relational → relate
Variants exist for many languages: English, French, Spanish, Portuguese, Italian, Romanian, German Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish
Fast Accurate enough for some applications, e.g.
Information Retrieval Available at http://snowball.tartarus.org/
LIMSI-CNRS | 29/72
Steps in Porter stemming (excerpts)
Step 1a SSES → SS
caresses → caress Step 1b (m>0) EED → EE
feed → feed, agreed → agree Step 1c (*v*) Y → I
happy → happi, sky → sky Step 2 (m>0) ATIONAL → ATE
relational → relate
Step 3 (m>0) ICATE → IC
triplicate → triplic Step 4 (m>1) AL →
revival → reviv Step 5a (m>1) E →
probate → probat Step 5b (m > 1 and *d and *L) → single
letter controll → control
LIMSI-CNRS | 30/72
Porter's stemmer
Original Word Stemmed Wordvision visionvisible visiblvisibility visiblvisionary visionarivisioner visionvisual visual
LIMSI-CNRS | 31/72
Comparison of three stemmers
© 2008 Cambridge University Press, Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze
LIMSI-CNRS | 32/72
Stemming errors
Under-stemming:adhere → adheradhesion → adhes
Over-stemming: appendicitis → append append → append
LIMSI-CNRS | 33/72
Ambiguity
I saw the saw
Preterite form of the verb
SEE
Singular formof the noun
SAW≠
Homographs: words which have the same spelling but different meanings
Such cases cannot be properly dealt with with stemming only, the word's grammatical category has to be identified
LIMSI-CNRS | 34/72
Lemmatisation
Lemmatisation consists in mapping word forms to their lemma (base form):
sing, sang, sung → sing
Lemmatisation only handles inflection, not derivation
In order to disambiguate ambiguous cases, lemmatisation is usually combined with part-of-speech tagging
Additional morphological information is usually provided with the lemma (more about this later in the presentation)
LIMSI-CNRS | 35/72
Outline
1. Word segmentation
2. Linguistic morphologya) Morphemesb) Morphological processes
3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 36/72
Morphological analysis
Aim: split a word into its constituent morphemes : foxes → fox + es get morpho-syntactic information : part-of-speech (POS),
tense, number, person, voice, gender, etc.
Morphological analysis can be perfomed: manually, the analyses are then stored in lexical databases automatically:based on some manually-written rules and lexicons in an unsupervised manner, using no external resources
LIMSI-CNRS | 37/72
Lexical databases: contents
word entries + information surface form, lemma syntactic properties category, POS (Part Of Speech) features: masculine, feminine, etc.
semantic properties semantic relations: synonym, antonym, hypernym semantic type: person, event, object
LIMSI-CNRS | 38/72
CELEX
CELEX is a lexical database which is available for English, Dutch and German
LIMSI-CNRS | 42/72
Unsupervised Segmentation
Unsupervised morphological segmentation consists in automatically breaking down words into their constituent morphemes
Only input dataset: list of words (no language-specific rules or lexicons)
Scientific goals: Learn of the phenomena underlying word construction in natural
languages Discover approaches suitable for a wide range of languages Advance machine learning methodology
See the Morpho Challenge websitehttp://www.cis.hut.fi/morphochallenge2009/
LIMSI-CNRS | 45/72
Segmentation by compression
Minimum Description Length and bayesian inference(Goldsmith, 2001; Creutz & Lagus, 2005)
LIMSI-CNRS | 46/72
Harris (1955):Segmentation by successor counts
At the end of a morpheme (or word) almost any sound can follow:
design + #, design + ation, design + ing, design + ed, ...
However, within morphemes, the choice is more restricted:
desig + n
Basic algorithm: At each position in an utterance, count the number of
different sounds which can possibly follow Peaks in this count indicate morpheme boundaries
LIMSI-CNRS | 47/72
Segmentation of He's quicker
Utterance = He's quicker (hiyzkwikәr) Successors of h: His ship's in? Humans act like simians. ...
Successors of hi: Hip-high in water. Hidden meanings were discovered. ...
LIMSI-CNRS | 48/72
Successor counts
h - i - y - z - k - w - i - k - ә - r -0
5
10
15
20
25
30
35
Sounds
Su
cce
ssor
cou
nts
LIMSI-CNRS | 49/72
Morphological parsing
Aim: break down a word into component morphemes and build a structured representation of the analysis
Example:cats → cat +N +PL
Our focus: finite-state morphological parsing
lemma features
LIMSI-CNRS | 50/72
Finite state automata
A finite state automaton (FSA) recognises a set of strings
An FSA is represented as directed graph: vertices (nodes) represent states directed links between nodes represent transitions
LIMSI-CNRS | 51/72
Sheep Talk FSA
The language of sheep includes to following utterances: baa!, baaa!, baaaa!, baaaaa!, etc.
Regular expression for this language: baa+! FSA that can accept this language:
q0
q2
q3
q4b a a !
a
q1
LIMSI-CNRS | 52/72
Formal definition of an FSA
Q = q0 q
1 q
2 ... q
N-1a finite set of N states
Σ a finite input alphabet of symbols
q0
the start state
F the set of final states δ(q,i) the transition function
For the sheep talk automaton: Q = {q0, q
1, q
2, q
3, q
4},
Σ = {a, b, !}, F = {q4}
LIMSI-CNRS | 53/72
Deterministic vs. non-deterministic FSA
Deterministic FSA for sheep talk
Non-deterministic FSA for sheep talk
q0
q2
q3
q4b a a !
a
q1
q0
q2 q3
q4b a a !
a
q1
LIMSI-CNRS | 54/72
Morphological parsers
Components: lexicon: list of lemmas and affixes morphotactics: word grammar which accounts for
morpheme ordering orthographic rules: model the changes that occur when two
morphemes combinecity + s → cities
Morphological parsers can be implemented as finite-state transducers
LIMSI-CNRS | 55/72
Finite State Transducers
State 0: start state State 1: cat has been recognised as a +N (possible end state) State 2: catch has been recognised as a +V (possible end state) State 3: cats has been recognised as +N +PL (possible end
state)
catch:V
cat:N
s:PL
0
1
2
3
Finite-state transducers map between one representation and another
LIMSI-CNRS | 56/72
Two-level morphology
(Koskenniemi, 1984)
Surface level: words as they are pronounced or written
Lexical level: concatenation of morphemesLexical level: c a t +N +PLSurface level: c a t s
The mapping between the surface and the lexical level is constrained by rules
LIMSI-CNRS | 57/72
Two-level rules
Example rule (Trost, 2003):
+:e ⇐ { s x z [ {s c} h ] } : _s
left context right contextsurface
level
lexicallevel
Application of the rule:# d i s h + s #| | | | | 1 | |0 d i s h e s 0
Spelling rule:
e-insertion
LIMSI-CNRS | 59/72
PC-Kimmo: POS Ambiguity
1:
Word:
[ cat: Word
head: [ pos: V
vform: BASE ]
root: `walk
root_pos:V
clitic:-
drvstem:- ]
2:
Word:
[ cat: Word
head: [ agr: [ 3sg: + ]
number:SG
pos: N
proper:-
verbal:- ]
root: `walk
root_pos:N
clitic:-
drvstem:- ]
LIMSI-CNRS | 60/72
Inflectional Analysis for French: Flemm
Developed by F. Namer http://www.cnrtl.fr/outils/flemm/
Input : word + POS (as provided by the TreeTagger or the Brill tagger) renouent VER:pres renouer
Output: lemma + morpho-syntactic features renouent VER(pres):Vmip3p--1 renouer
|| renouent VER(pres):Vmsp3p--1 renouer Verbe au présent de l'indicatif ou du subjonctif à la troisième personne du pluriel, 1er
groupe
LIMSI-CNRS | 61/72
Inflectional Analysis for French: Flemm
Analyse linguistique : le cas de -èrent en général, -èrent marque les verbes du 3ème groupe au
passé simple : céd-èrent quelquefois, la terminaison est plus courte et -èrent
marque le présent : légifèr-ent très rarement, terminaison ambiguë : lac-èrent et lacèr-
ent Règles et exceptions : le cas de -èrent les partitions ambiguës sont lexicalisées car rares la règle étant le désuffixage sur le suffixe le plus long, les
verbes correspondant au suffixe -ent tels que légifèr- sont lexicalisés
autres cas (e.g. céd-) : désuffixage régulier sur -èrent.
LIMSI-CNRS | 62/72
Derivational Analysis: DériF
Developed by F. Namerhttp://www.cnrtl.fr/outils/DeriF/
Input: form/POS sympathique/ADJ
Output: analysis [ [ sympathie NOM] ique ADJ] (sympathique/ADJ,
sympathie/NOM) " En rapport avec le(s) sympathie"
LIMSI-CNRS | 63/72
Derivational Analysis: DériF
Word formation rules déXiser V [dé [X N] +iser V] Xable A [[X (er) V] +able A] inX A [in [X A] A]
Sequence of decompositions impensable/ADJ in + pensable/ADJ décomposable/ADJ décomposer/VERBE + able/ADJ
Ambiguous analyses implantable/ADJ implanter/VERBE + able/ADJ
im + plantable/ADJ
Produces a gloss : " ( lequel - Que l') on peut implanter" // " Non plantable"
LIMSI-CNRS | 64/72
Analysis of neoclassical compounds: DériF
acrodynie/N Hierarchical decomposition: [ [ acr N* ] [ odyn N* ] ie NOM ]
Definition (gloss): "douleur (du -- liée au) extrémité "
Semantic type:Type = maladie
Lexical and semantic relations with other lexemes:eql:acr/algie, eql:acr/algo, eql:acr/algés, eql:apex/algie,
eql:apex/algo, eql:apex/algés, eql:apex/odyn see:acr/ite, see:apex/ite
LIMSI-CNRS | 65/72
Outline
1. Word segmentation
2. Linguistic morphologya) Morphemesb) Morphological processes
3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised
segmentation, rule-based analysis and parsing
4. Applications
LIMSI-CNRS | 66/72
Information RetrievalStemming
Stemming is frequently used in Information Retrieval: Stemming is applied at indexing time User queries are analysed likewise Stems in the user query are matched against stems in
documents
It reduces the number of terms to index
It improves recall (number of documents which are retrieved)
LIMSI-CNRS | 67/72
Information RetrievalMorphological Query Expansion
Morphological variants of a word can be used to perform query expansion
The original word forms are indexed
Query terms are expanded with their morphological variants at retrieval time (Moreau et al., 2007)
Original query: Ineffectiveness of U.S. embargoes or sanctions
Expanded query: ineffectiveness ineffective effectiveness effective ineffectively embargoes embargo embargoed embargoing sanctioning sanction sanctioned sanctions sanctionable
LIMSI-CNRS | 68/72
Text-To-Speech Systems
Aim: take text, in standard spelling, and synthesise a spoken version of the text
Problems Proper names (places, persons) Out of vocabulary words (words unknown to the system)
Solutions from morphology hothouse = hot + house and not hoth + ouse
LIMSI-CNRS | 69/72
Machine Translation
Aim: translate a text from one language into another language
Problems: A word in one language may correspond to two or more
words in another language Out of vocabulary words
How can morphological analysis help? compounds: Aktionsplan (de) → action plan (en) inflection: va, aller (fr) → go (en)
LIMSI-CNRS | 70/72
Meditate on this...
"Maybe in order to understand mankind, we have to look at the word itself. Mankind. Basically, it's made up of two separate words – 'mank' and 'ind'. What do these words mean? It's a mystery, and that's why so is mankind."
Jack Handey (Deep Thoughts)
LIMSI-CNRS | 71/72
Relevant Literature
Creutz, M. & Lagus, K. (2005), Inducing the Morphological Lexicon of a Natural Language from Unannotated Text, in 'Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05)', pp. 106-113.
Goldsmith, J. (2001), 'Unsupervised Learning of the Morphology of a Natural Language', Computational Linguistics 27(2), 153-198.
Harris, Z. (1955), 'From phoneme to morpheme', Language 31(2), 190-222.
Koskenniemi, K. (1984), A general computational model for word-form recognition and production, in 'Proceedings of the 22nd annual meeting on Association for Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 178--181.
Lepage, Y. (1998), Solving analogies on words: an algorithm, in 'Proceedings of the 17th international conference on Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 728-734.
LIMSI-CNRS | 72/72
Relevant Literature
Moreau, F.; Claveau, V. & Sébillot, P. (2007), 'Automatic morphological query expansion using analogy-based machine learning.', in Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), Roma, Italy, April 2007.
Trost, H. (2003), The Oxford Handbook of Computational Linguistics, Oxford University Press, chapter Morphology, pp. 25--47.
Saffran, J. R.; Newport, E. L. & Aslin, R. N. (1996), 'Word Segmentation: The Role of Distributional Cues', Journal of Memory and Language 35(4), 606-621.