morphology - limsi

LIMSI-CNRS | 1/72

Fondements du TAL

Delphine Bernhard

Morphology

Contains slides adapted from Pierre Zweigenbaum

LIMSI-CNRS | 2/72

Outline

1. Word segmentation

2. Linguistic morphologya) Morphemesb) Morphological processes

3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised

segmentation, rule-based analysis and parsing

4. Applications

LIMSI-CNRS | 3/72

Levels of linguistic structure

(c) David Groome, 2006

Our focus today

LIMSI-CNRS | 4/72

Outline





4. Applications

LIMSI-CNRS | 5/72

Can you read this (fast!)?

wikipédiaestunprojetdencyclopédiecollectiveétabliesurinternetuniversellemultilingueetfonctionnantsurleprincipeduwikiwikipédiaapourobjectifdoffriruncontenulibrementréutilisableneutreetvérifiablequechacunpeutéditeretaméliorer

Wikipédia est un projet d’encyclopédie collective établie sur Internet, universelle, multilingue et fonctionnant sur le principe du wiki. Wikipédia a pour objectif d’offrir un contenu librement réutilisable, neutre et vérifiable, que chacun peut éditer et améliorer.

LIMSI-CNRS | 6/72

It's only words ...

... but what are they exactly and how can we automatically recognise them?

In speech, there are no obvious breaks So how do babies learn words? According to (Saffran et al.,

1996) they use distributional cues and statistical regularities in speech

LIMSI-CNRS | 7/72

How do we recognise words in speech? (Bauer, 1988)

There are no gaps between words in speech: Menbecomeoldbuttheyneverbecomegood

Thanks to our knowledge of language, we recognise certain strings of sounds/letters: e.g. we can recognise men in the previous sequence because

it also comes up in sequences like:MenareconservativeafterdinnerMenlosetheirtempersindefendingtheirtaste.Afterfortymenhavemarriedtheirhabits.

LIMSI-CNRS | 8/72

Learning to read is difficult for humans

Reading disabilities: Dyslexia: inability to decode, or break down, words into

phonemes Comprehension difficulties

The invention of writing and reading is recent

Contrarily to speech or vision, it is anunnatural process that has to be learned: brains are not wired to read!

LIMSI-CNRS | 9/72

For computers: characters and strings

Control characters: End of line: \n Tabulation: \t

Encodings: ASCII: English alphabet Latin 1, ISO-8859-1: Western European Languages ISO-8859-15: Similar to ISO-8859-1, but replaces some less

common symbols with €, Œ or œ Windows-1252, Cp1252: superset of ISO 8859-1 (includes €,

Œ and œ) UTF-8: can represent every character in the Unicode

character set, backward-compatible with ASCII

LIMSI-CNRS | 10/72

Practical definition of wordsand sentences

Bauer (1988): A word is a unit which, in print, is bounded by spaces on both

sides. We will call this an orthographic word.

Kučera and Francis (1967): A graphic word is a string of contiguous alphanumeric

characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks

Grefenstette and Tapanainen (1994): Sentences end with punctuation.

LIMSI-CNRS | 11/72

What are the "words" and sentences here?

Pacific Lumber Co. was trying to figure out the safest way to bring the activists down.

He doesn't need us.

For additional information see also http://www.limsi.fr

New York is situated on the east coast of the United States.

c’est-à-dire les pommes de terre des U.S.A.

LIMSI-CNRS | 12/72

Tokenisation

Tokenisation: process which divides the input text into word tokens: punctuation marks, word-like units, numbers, etc.

A system which splits texts into word tokens is called a tokeniser

A very simple example: Input text:

John likes Mary and Mary likes John. Tokens:

{"John", "likes", "Mary", "and", "Mary", "likes", "John", "."}

LIMSI-CNRS | 13/72

Problems of tokenisation

Numeric expressions:The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.

How many words are there in 4.53 +/- 0.15%? 1 3 9 (“four point five three, plus or minus fifteen percent”) not a word

The answer depends on the application at hand

LIMSI-CNRS | 14/72

Problems of tokenisation

Boundaries

For "simple" words:SpacesPunctuation

Multiword expressions: several units, one wordpomme de terreparce que

Contracted forms: one unit, several wordsaux (à les), des (de les)

LIMSI-CNRS | 15/72

Outline


2. Linguistic morphologya) What is morphology?b) Morphological processes



4. Applications

LIMSI-CNRS | 16/72

Morphology

Words can be further decomposed into smaller units:

pneumonoultramicroscopicsilicovolcanoconiosis

lungextreme

microscopicsilicium

volcanodust

disease

lung disease caused by the inhalation of very fine silica dust found in volcanoes

LIMSI-CNRS | 17/72

What is morphology?

Morphology is the branch of linguistics which studies word forms and word formation

Word formation processes Inflection Derivation Composition / Compounding

LIMSI-CNRS | 18/72

Words vs. lexemes vs. lemmas

A lexeme refers to the set of word forms which correspond to the same dictionary entry

small, smaller, smallest → SMALLknife, knives → KNIFE

A lemma is the canonical form of a lexemeSMALL

In the following, capital letters are used to indicate lemmas

LIMSI-CNRS | 19/72

Inflection

Inflection is the process of forming different grammatical forms of a single lexememontrer → montreracheval → chevaux

The grammatical category of the word form remains the same

LIMSI-CNRS | 20/72

Word formation

Word formation is the process of creating new lexemes from existing ones: Derivation: combines bases and affixes Compounding: combines lexemes

LIMSI-CNRS | 21/72

Derivation

Derivation involves the creation of one lexeme from another

re- + create → RECREATEre- is a derivational prefix

recreate + s → recreates-s is an inflectional suffix, it provides another word-form of the lexeme RECREATE!

Derivation might induce a change of the grammatical category

be- + witch → BEWITCH: changes a noun into a verb

LIMSI-CNRS | 22/72

Compounding

A compound involves the creation of one lexeme from two or more other lexemes

popcorn = a kind of corn which popshot dog = a kind of food (opaque compound)

Compounding is particularly frequent in French medical language appendice + ectomie → appendicectomie

LIMSI-CNRS | 23/72

Non concatenative phenomena

Root-and-pattern morphology (e.g. Arabic, Hebrew) the root consists of consonants only (3 by default)

ktb = to write the pattern is a combination of vowels (possibly consonants

too) with slots for the root consonantskaatab = he corresponded

Apophony: vowel changes within a root Ablaut: sing, sang, sung Umlaut: Buch, Bücher

LIMSI-CNRS | 24/72

Outline





4. Applications

LIMSI-CNRS | 25/72

Morphological normalisation

Morphological normalisation consists in identifying a single canonical representative for morphologically related word-forms

Methods: Stemming Lemmatisation

LIMSI-CNRS | 26/72

Stemming

Stemming is an algorithmic approach to strip off the endings of words

Objective: group words belonging to the same morphological family by transforming them into a similar stemmed representation

Stemming does not distinguish between inflection and derivation

The stems obtained do not necessarily correspond to a genuine word form

The best known stemming algorithms have been developed by Lovins (1968) and Porter (1980)

LIMSI-CNRS | 27/72

Algorithmic stemming method

1) Desuffixing: removal of predefined word endingssitting → sitt

2) Recoding: transform the endings of the previously obtained stems using transformation rulessitt → sit

These 2 phases can be performed successively (Lovins) or simultaneously (Porter)

LIMSI-CNRS | 28/72

Porter's stemmer

Based on a limited set of general cascaded transformational rules:

-ational → -ate : relational → relate

Variants exist for many languages: English, French, Spanish, Portuguese, Italian, Romanian, German Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish

Fast Accurate enough for some applications, e.g.

Information Retrieval Available at http://snowball.tartarus.org/

LIMSI-CNRS | 29/72

Steps in Porter stemming (excerpts)

Step 1a SSES → SS

caresses → caress Step 1b (m>0) EED → EE

feed → feed, agreed → agree Step 1c (*v*) Y → I

happy → happi, sky → sky Step 2 (m>0) ATIONAL → ATE

relational → relate

Step 3 (m>0) ICATE → IC

triplicate → triplic Step 4 (m>1) AL →

revival → reviv Step 5a (m>1) E →

probate → probat Step 5b (m > 1 and *d and *L) → single

letter controll → control

LIMSI-CNRS | 30/72

Porter's stemmer

Original Word Stemmed Wordvision visionvisible visiblvisibility visiblvisionary visionarivisioner visionvisual visual

LIMSI-CNRS | 31/72

Comparison of three stemmers

© 2008 Cambridge University Press, Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze

LIMSI-CNRS | 32/72

Stemming errors

Under-stemming:adhere → adheradhesion → adhes

Over-stemming: appendicitis → append append → append

LIMSI-CNRS | 33/72

Ambiguity

I saw the saw

Preterite form of the verb

SEE

Singular formof the noun

SAW≠

Homographs: words which have the same spelling but different meanings

Such cases cannot be properly dealt with with stemming only, the word's grammatical category has to be identified

LIMSI-CNRS | 34/72

Lemmatisation

Lemmatisation consists in mapping word forms to their lemma (base form):

sing, sang, sung → sing

Lemmatisation only handles inflection, not derivation

In order to disambiguate ambiguous cases, lemmatisation is usually combined with part-of-speech tagging

Additional morphological information is usually provided with the lemma (more about this later in the presentation)

LIMSI-CNRS | 35/72

Outline





4. Applications

LIMSI-CNRS | 36/72

Morphological analysis

Aim: split a word into its constituent morphemes : foxes → fox + es get morpho-syntactic information : part-of-speech (POS),

tense, number, person, voice, gender, etc.

Morphological analysis can be perfomed: manually, the analyses are then stored in lexical databases automatically:based on some manually-written rules and lexicons in an unsupervised manner, using no external resources

LIMSI-CNRS | 37/72

Lexical databases: contents

word entries + information surface form, lemma syntactic properties category, POS (Part Of Speech) features: masculine, feminine, etc.

semantic properties semantic relations: synonym, antonym, hypernym semantic type: person, event, object

LIMSI-CNRS | 38/72

CELEX

CELEX is a lexical database which is available for English, Dutch and German

LIMSI-CNRS | 39/72

Morphalou

http://www.cnrtl.fr/lexiques/morphalou/

LIMSI-CNRS | 40/72

Prolex

http://www.cnrtl.fr/lexiques/prolex/

LIMSI-CNRS | 41/72

French Verbs (Dubois & Dubois-Charlier)

LIMSI-CNRS | 42/72

Unsupervised Segmentation

Unsupervised morphological segmentation consists in automatically breaking down words into their constituent morphemes

Only input dataset: list of words (no language-specific rules or lexicons)

Scientific goals: Learn of the phenomena underlying word construction in natural

languages Discover approaches suitable for a wide range of languages Advance machine learning methodology

See the Morpho Challenge websitehttp://www.cis.hut.fi/morphochallenge2009/

LIMSI-CNRS | 43/72

Segmentation by analogy

(Lepage, 1998)

LIMSI-CNRS | 44/72

Application of the analogy principle

fahre

fahren

schlafe

X?

LIMSI-CNRS | 45/72

Segmentation by compression

Minimum Description Length and bayesian inference(Goldsmith, 2001; Creutz & Lagus, 2005)

LIMSI-CNRS | 46/72

Harris (1955):Segmentation by successor counts

At the end of a morpheme (or word) almost any sound can follow:

design + #, design + ation, design + ing, design + ed, ...

However, within morphemes, the choice is more restricted:

desig + n

Basic algorithm: At each position in an utterance, count the number of

different sounds which can possibly follow Peaks in this count indicate morpheme boundaries

LIMSI-CNRS | 47/72

Segmentation of He's quicker

Utterance = He's quicker (hiyzkwikәr) Successors of h: His ship's in? Humans act like simians. ...

Successors of hi: Hip-high in water. Hidden meanings were discovered. ...

LIMSI-CNRS | 48/72

Successor counts

h - i - y - z - k - w - i - k - ә - r -0

5

10

15

20

25

30

35

Sounds

Su

cce

ssor

cou

nts

LIMSI-CNRS | 49/72

Morphological parsing

Aim: break down a word into component morphemes and build a structured representation of the analysis

Example:cats → cat +N +PL

Our focus: finite-state morphological parsing

lemma features

LIMSI-CNRS | 50/72

Finite state automata

A finite state automaton (FSA) recognises a set of strings

An FSA is represented as directed graph: vertices (nodes) represent states directed links between nodes represent transitions

LIMSI-CNRS | 51/72

Sheep Talk FSA

The language of sheep includes to following utterances: baa!, baaa!, baaaa!, baaaaa!, etc.

Regular expression for this language: baa+! FSA that can accept this language:

q0

q2

q3

q4b a a !

a

q1

LIMSI-CNRS | 52/72

Formal definition of an FSA

Q = q0 q

1 q

2 ... q

N-1a finite set of N states

Σ a finite input alphabet of symbols

q0

the start state

F the set of final states δ(q,i) the transition function

For the sheep talk automaton: Q = {q0, q

1, q

2, q

3, q

4},

Σ = {a, b, !}, F = {q4}

LIMSI-CNRS | 53/72

Deterministic vs. non-deterministic FSA

Deterministic FSA for sheep talk

Non-deterministic FSA for sheep talk

q0

q2

q3

q4b a a !

a

q1

q0

q2 q3

q4b a a !

a

q1

LIMSI-CNRS | 54/72

Morphological parsers

Components: lexicon: list of lemmas and affixes morphotactics: word grammar which accounts for

morpheme ordering orthographic rules: model the changes that occur when two

morphemes combinecity + s → cities

Morphological parsers can be implemented as finite-state transducers

LIMSI-CNRS | 55/72

Finite State Transducers

State 0: start state State 1: cat has been recognised as a +N (possible end state) State 2: catch has been recognised as a +V (possible end state) State 3: cats has been recognised as +N +PL (possible end

state)

catch:V

cat:N

s:PL

0

1

2

3

Finite-state transducers map between one representation and another

LIMSI-CNRS | 56/72

Two-level morphology

(Koskenniemi, 1984)

Surface level: words as they are pronounced or written

Lexical level: concatenation of morphemesLexical level: c a t +N +PLSurface level: c a t s

The mapping between the surface and the lexical level is constrained by rules

LIMSI-CNRS | 57/72

Two-level rules

Example rule (Trost, 2003):

+:e ⇐ { s x z [ {s c} h ] } : _s

left context right contextsurface

level

lexicallevel

Application of the rule:# d i s h + s #| | | | | 1 | |0 d i s h e s 0

Spelling rule:

e-insertion

LIMSI-CNRS | 58/72

PC-KIMMO

Demo: http://languagelink.let.uu.nl/~lion

LIMSI-CNRS | 59/72

PC-Kimmo: POS Ambiguity

1:

Word:

[ cat: Word

head: [ pos: V

vform: BASE ]

root: `walk

root_pos:V

clitic:-

drvstem:- ]

2:

Word:

[ cat: Word

head: [ agr: [ 3sg: + ]

number:SG

pos: N

proper:-

verbal:- ]

root: `walk

root_pos:N

clitic:-

drvstem:- ]

LIMSI-CNRS | 60/72

Inflectional Analysis for French: Flemm

Developed by F. Namer http://www.cnrtl.fr/outils/flemm/

Input : word + POS (as provided by the TreeTagger or the Brill tagger) renouent VER:pres renouer

Output: lemma + morpho-syntactic features renouent VER(pres):Vmip3p--1 renouer

|| renouent VER(pres):Vmsp3p--1 renouer Verbe au présent de l'indicatif ou du subjonctif à la troisième personne du pluriel, 1er

groupe

LIMSI-CNRS | 61/72

Inflectional Analysis for French: Flemm

Analyse linguistique : le cas de -èrent en général, -èrent marque les verbes du 3ème groupe au

passé simple : céd-èrent quelquefois, la terminaison est plus courte et -èrent

marque le présent : légifèr-ent très rarement, terminaison ambiguë : lac-èrent et lacèr-

ent Règles et exceptions : le cas de -èrent les partitions ambiguës sont lexicalisées car rares la règle étant le désuffixage sur le suffixe le plus long, les

verbes correspondant au suffixe -ent tels que légifèr- sont lexicalisés

autres cas (e.g. céd-) : désuffixage régulier sur -èrent.

LIMSI-CNRS | 62/72

Derivational Analysis: DériF

Developed by F. Namerhttp://www.cnrtl.fr/outils/DeriF/

Input: form/POS sympathique/ADJ

Output: analysis [ [ sympathie NOM] ique ADJ] (sympathique/ADJ,

sympathie/NOM) " En rapport avec le(s) sympathie"

LIMSI-CNRS | 63/72

Derivational Analysis: DériF

Word formation rules déXiser V [dé [X N] +iser V] Xable A [[X (er) V] +able A] inX A [in [X A] A]

Sequence of decompositions impensable/ADJ in + pensable/ADJ décomposable/ADJ décomposer/VERBE + able/ADJ

Ambiguous analyses implantable/ADJ implanter/VERBE + able/ADJ

im + plantable/ADJ

Produces a gloss : " ( lequel - Que l') on peut implanter" // " Non plantable"

LIMSI-CNRS | 64/72

Analysis of neoclassical compounds: DériF

acrodynie/N Hierarchical decomposition: [ [ acr N* ] [ odyn N* ] ie NOM ]

Definition (gloss): "douleur (du -- liée au) extrémité "

Semantic type:Type = maladie

Lexical and semantic relations with other lexemes:eql:acr/algie, eql:acr/algo, eql:acr/algés, eql:apex/algie,

eql:apex/algo, eql:apex/algés, eql:apex/odyn see:acr/ite, see:apex/ite

LIMSI-CNRS | 65/72

Outline





4. Applications

LIMSI-CNRS | 66/72

Information RetrievalStemming

Stemming is frequently used in Information Retrieval: Stemming is applied at indexing time User queries are analysed likewise Stems in the user query are matched against stems in

documents

It reduces the number of terms to index

It improves recall (number of documents which are retrieved)

LIMSI-CNRS | 67/72

Information RetrievalMorphological Query Expansion

Morphological variants of a word can be used to perform query expansion

The original word forms are indexed

Query terms are expanded with their morphological variants at retrieval time (Moreau et al., 2007)

Original query: Ineffectiveness of U.S. embargoes or sanctions

Expanded query: ineffectiveness ineffective effectiveness effective ineffectively embargoes embargo embargoed embargoing sanctioning sanction sanctioned sanctions sanctionable

LIMSI-CNRS | 68/72

Text-To-Speech Systems

Aim: take text, in standard spelling, and synthesise a spoken version of the text

Problems Proper names (places, persons) Out of vocabulary words (words unknown to the system)

Solutions from morphology hothouse = hot + house and not hoth + ouse

LIMSI-CNRS | 69/72

Machine Translation

Aim: translate a text from one language into another language

Problems: A word in one language may correspond to two or more

words in another language Out of vocabulary words

How can morphological analysis help? compounds: Aktionsplan (de) → action plan (en) inflection: va, aller (fr) → go (en)

LIMSI-CNRS | 70/72

Meditate on this...

"Maybe in order to understand mankind, we have to look at the word itself. Mankind. Basically, it's made up of two separate words – 'mank' and 'ind'. What do these words mean? It's a mystery, and that's why so is mankind."

Jack Handey (Deep Thoughts)

LIMSI-CNRS | 71/72

Relevant Literature

Creutz, M. & Lagus, K. (2005), Inducing the Morphological Lexicon of a Natural Language from Unannotated Text, in 'Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05)', pp. 106-113.

Goldsmith, J. (2001), 'Unsupervised Learning of the Morphology of a Natural Language', Computational Linguistics 27(2), 153-198.

Harris, Z. (1955), 'From phoneme to morpheme', Language 31(2), 190-222.

Koskenniemi, K. (1984), A general computational model for word-form recognition and production, in 'Proceedings of the 22nd annual meeting on Association for Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 178--181.

Lepage, Y. (1998), Solving analogies on words: an algorithm, in 'Proceedings of the 17th international conference on Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 728-734.

LIMSI-CNRS | 72/72

Relevant Literature

Moreau, F.; Claveau, V. & Sébillot, P. (2007), 'Automatic morphological query expansion using analogy-based machine learning.', in Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), Roma, Italy, April 2007.

Trost, H. (2003), The Oxford Handbook of Computational Linguistics, Oxford University Press, chapter Morphology, pp. 25--47.

Saffran, J. R.; Newport, E. L. & Aslin, R. N. (1996), 'Word Segmentation: The Role of Distributional Cues', Journal of Memory and Language 35(4), 606-621.

morphology - limsi

Documents