roots & patterns vs. stems plus grammar-lexis specifications: on what basis should a...

42
Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy Joseph Dichy Ali Ali Farghaly Farghaly Université Lumière-Lyon 2, Université Lumière-Lyon 2, Systran Systran Software Inc. Software Inc. Lyon, France Lyon, France San Diego (CA), USA San Diego (CA), USA [email protected] [email protected]

Upload: lindsay-powers

Post on 23-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Roots & Patterns vs. Stems plus Grammar-Lexis Specifications:

on what basis should a multilingual lexical database centred on Arabic

be built?

Joseph Dichy Joseph Dichy Ali FarghalyAli FarghalyUniversité Lumière-Lyon 2, Université Lumière-Lyon 2, Systran Software Inc.Systran Software Inc.

Lyon, FranceLyon, France San Diego (CA), USASan Diego (CA), [email protected]@univ-lyon2.fr [email protected]@systransoft.com

Paper presented at the: Paper presented at the: IXth MT Summit – Workshop on Machine Translation for IXth MT Summit – Workshop on Machine Translation for Semitic Languages: issues and approachesSemitic Languages: issues and approaches – New Orleans, USA, Sept. 23, 2003 – New Orleans, USA, Sept. 23, 2003

Page 2: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Keywords

MT, multilingual lexical databasesMT, multilingual lexical databases NLP and MT feasibility in Semitic languages NLP and MT feasibility in Semitic languages Arabic morphology Arabic morphology & morphotactics (word-& morphotactics (word-

form structure)form structure) Semitic roots and patternsSemitic roots and patterns stem-based lexiconsstem-based lexicons morphosyntactic specifiers morphosyntactic specifiers grammar-lexis specificationsgrammar-lexis specifications

Page 3: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Keywords

MT, multilingual lexical databasesMT, multilingual lexical databases NLP and MT feasibility in Semitic languages NLP and MT feasibility in Semitic languages Arabic morphology & morphotactics (word-Arabic morphology & morphotactics (word-

form structure)form structure) Semitic roots and patternsSemitic roots and patterns stem-based lexiconsstem-based lexicons morphosyntactic specifiers morphosyntactic specifiers grammar-lexis specificationsgrammar-lexis specifications

Page 4: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Keywords

MT, multilingual lexical databasesMT, multilingual lexical databases NLP and MT feasibility in Semitic languages NLP and MT feasibility in Semitic languages Arabic morphology Arabic morphology & morphotactics (word-& morphotactics (word-

form structure)form structure) Semitic roots and patternsSemitic roots and patterns stem-based lexiconsstem-based lexicons morphosyntactic specifiers morphosyntactic specifiers grammar-lexis specificationsgrammar-lexis specifications

Page 5: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Lexical databases in Semitic languages

ROOT-&-PATTERN groundedROOT-&-PATTERN grounded analysis of analysis of fully fully ‘vowelled’ Arabic script‘vowelled’ Arabic script

Pioneering works (D. Cohen,Pioneering works (D. Cohen, 1961/70, Hlal, 1979)1961/70, Hlal, 1979)

Page 6: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Lexical databases in Semitic languages

ROOT-&-PATTERN groundedROOT-&-PATTERN grounded analysis of analysis of fully fully ‘vowelled’ Arabic script‘vowelled’ Arabic script

Pioneering works (D. Cohen,Pioneering works (D. Cohen, 1961/70, Hlal, 1979)1961/70, Hlal, 1979) STEM + ROOT-&-PATTERN STEM + ROOT-&-PATTERN approaches:approaches:

–– Arabic computational lexicon of T. BuckwalterArabic computational lexicon of T. Buckwalter (Buckwalter, 1990, Beesley, 2001, Maamouri & Cieri, (Buckwalter, 1990, Beesley, 2001, Maamouri & Cieri, 2002)2002)

Page 7: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Lexical databases in Semitic languages

ROOT-&-PATTERN groundedROOT-&-PATTERN grounded analysis of analysis of fully fully ‘vowelled’ Arabic script‘vowelled’ Arabic script

Pioneering works (D. Cohen,Pioneering works (D. Cohen, 1961/70, Hlal, 1979)1961/70, Hlal, 1979) STEM + ROOT-&-PATTERN STEM + ROOT-&-PATTERN approaches:approaches:–– Arabic computational lexicon of T. BuckwalterArabic computational lexicon of T. Buckwalter

(Buckwalter, 1990, Beesley, 2001, Maamouri & Cieri, (Buckwalter, 1990, Beesley, 2001, Maamouri & Cieri, 2002)2002)

–– Lexeme-Based Morphological treatment of ArabicLexeme-Based Morphological treatment of Arabic (after (after Aronoff, 1994 and Beard, 1995): “Only the stem is Aronoff, 1994 and Beard, 1995): “Only the stem is morphologically relevant in that realization rules act on morphologically relevant in that realization rules act on it" (Soudi, Cavalli-Sforza, Jamari, 2001).it" (Soudi, Cavalli-Sforza, Jamari, 2001).

Page 8: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Lexical databases in Semitic languages(2)

STEM-grounded databaseSTEM-grounded database, the entries of which are , the entries of which are associated with associated with grammar-lexis specificationsgrammar-lexis specifications..

Feasibility conditions for the recognition Feasibility conditions for the recognition

of of Arabic standard vowel-free writingArabic standard vowel-free writing

Desclés & al. (1983), Dichy (1984/89), SAMIA (1984), Desclés & al. (1983), Dichy (1984/89), SAMIA (1984), Hassoun (1987), Dichy & Hassoun, eds. (1989), Dichy, Hassoun (1987), Dichy & Hassoun, eds. (1989), Dichy, Braham, Ghazali & Hassoun (2002)Braham, Ghazali & Hassoun (2002)

Page 9: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Paper focuses on:

the precise reasons why and how the precise reasons why and how

stem-grounded lexical databases,stem-grounded lexical databases,

with entries associated with with entries associated with

grammar-lexis specifications,grammar-lexis specifications,

should be recommended in Arabic NLP should be recommended in Arabic NLP applications, with special reference to MT.applications, with special reference to MT.

Page 10: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

ROOTS & PATTERNS As is well-known:As is well-known:

ROOTSROOTS (massively) = (massively) = - ordered sequences of three consonants - ordered sequences of three consonants - traditionally considered representative of a semantic- traditionally considered representative of a semantic field. field. Related nouns, verbs and adjectives are considered as Related nouns, verbs and adjectives are considered as generated through processes of vocalization and generated through processes of vocalization and affixation, forming a syllabic affixation, forming a syllabic PATTERNPATTERN. . Combination of roots and patterns in linguistic units is Combination of roots and patterns in linguistic units is both non-concatenative (McCarthy, e.g. 1981) and both non-concatenative (McCarthy, e.g. 1981) and sensitive to constraints and rules (point going back to sensitive to constraints and rules (point going back to Medieval Arabic linguistics).Medieval Arabic linguistics).

Page 11: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

The root-&-pattern paradigm (Cantineau, 1950)(Cantineau, 1950) (D. Cohen, 1961/70) and many others: (D. Cohen, 1961/70) and many others:

The assumption is: The assumption is: Roots and patterns define the Roots and patterns define the meaning of lexical entries in Arabic. meaning of lexical entries in Arabic.

Nouns, verbs and adjectives result from combination of Nouns, verbs and adjectives result from combination of

(a) the "general meaning" of a given root, and (a) the "general meaning" of a given root, and

(b) a "specific meaning" associated with a pattern.(b) a "specific meaning" associated with a pattern.

Page 12: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

The root-&-pattern paradigm (Cantineau, 1950)(Cantineau, 1950) (D. Cohen, 1961/70) and many others: (D. Cohen, 1961/70) and many others:

The assumption is: The assumption is: Roots and patterns define the Roots and patterns define the meaning of lexical entries in Arabic. meaning of lexical entries in Arabic. Nouns, verbs and adjectives result from combination of Nouns, verbs and adjectives result from combination of (a) the "general meaning" of a given root, and (a) the "general meaning" of a given root, and (b) a "specific meaning" associated with a pattern. (b) a "specific meaning" associated with a pattern.

The whole lexicon of the language could then The whole lexicon of the language could then be generated using two database (a ROOT be generated using two database (a ROOT and a PATTERN dB), to which rules and a PATTERN dB), to which rules accounting for constraints of various natures accounting for constraints of various natures would be added.would be added.

Page 13: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Limits of the ROOT-&-PATTERN representation (1)

Root-&-pattern representation is only valid Root-&-pattern representation is only valid for a subset of the lexiconfor a subset of the lexicon

A substantial subset of nouns is A substantial subset of nouns is not subject to analysis not subject to analysis in terms of root and patternin terms of root and pattern (Dichy, 1984/89; (Dichy, 1984/89; Hassoun, 1987).Hassoun, 1987).

•• Ancient and medieval Arabic examples:Ancient and medieval Arabic examples:?ismâ‘îl?ismâ‘îl ( (إسماعيلإسماعيل), “Ishmael”, ), “Ishmael”, nâranjnâranj ( (نارنجنارنج)),, “orange”, “orange”, sunûnûsunûnû ( (سنونوسنونو)), , “sparrow”, “sparrow”, ssirâirâtt ( (سراطسراط)),, “path, way”; “path, way”;

•• Modern standard Arabic examples: Modern standard Arabic examples: fusfâtfusfât or or fufussfât fât (( فصفات ـ فصفات فسفات ـ ,”phosphate”, ), “phosphate“ ,(فسفاتnaylûnnaylûn or or nîlûnnîlûn ( (نيلوننيلون)),, “nylon”. “nylon”.

Page 14: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Limits of the ROOT-&-PATTERN representation (2)

Root-&-pattern representation is essentially Root-&-pattern representation is essentially valid for verbs and deverbalsvalid for verbs and deverbals ..

Page 15: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Limits of the ROOT&PATTERN representation (2)

Root-&-pattern representation is essentially Root-&-pattern representation is essentially valid for verbs and deverbalsvalid for verbs and deverbals ..

Form-to-form derivational relations essentially Form-to-form derivational relations essentially operate in the domain of verbs and basic verbo-operate in the domain of verbs and basic verbo-nominal derivatives,nominal derivatives, such as infinitive forms such as infinitive forms ((mamassdar dar – – مصدرمصدر) and active or passive ) and active or passive participles (participles (?ism al-fâ‘il, ?ism al-maf‘ûl – ?ism al-fâ‘il, ?ism al-maf‘ûl – اسم اسم

والمفعول والمفعول الفاعل .( .(الفاعل

Page 16: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Limits of the ROOT-&-PATTERN representation (2)

Root-&-pattern representation is essentially Root-&-pattern representation is essentially valid for verbs and deverbalsvalid for verbs and deverbals ..

Form-to-form derivational relations essentially Form-to-form derivational relations essentially operate in the domain of verbs and basic verbo-operate in the domain of verbs and basic verbo-nominal derivatives, such as infinitive forms nominal derivatives, such as infinitive forms ((mamassdar dar – – مصدرمصدر) and active or passive ) and active or passive participles (participles (?ism al-fâ‘il, ?ism al-maf‘ûl – ?ism al-fâ‘il, ?ism al-maf‘ûl – اسم اسم

والمفعول والمفعول الفاعل .( .(الفاعل All Arabic verbs and all verbo-nominal All Arabic verbs and all verbo-nominal

derivativesderivatives can be analysed in terms of root can be analysed in terms of root and patternand pattern (Dichy, 1984/89, 1997).(Dichy, 1984/89, 1997).

Page 17: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Limits of the ROOT-&-PATTERN representation (3)

The The root-&-pattern paradigm root-&-pattern paradigm appears to be appears to be doubly mistaken.doubly mistaken.

Extending its representation to the entire Extending its representation to the entire lexicon: lexicon: (a) leaves a large number of lexical entries un-(a) leaves a large number of lexical entries un-

represented (represented (a substantial subset of nounsa substantial subset of nouns),), (b) does not sufficiently take into account its own (b) does not sufficiently take into account its own

effective domain of validation (effective domain of validation (verbs and basic verbs and basic verbo-nominal derivativesverbo-nominal derivatives))..

Page 18: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Arabic word-form structure

‘‘Traditional’ representation of the word-form Traditional’ representation of the word-form maximal maximal

_______word-form______________word-form_______ | | | |

minimal minimal ____word-form___ ____word-form___

| | | | ## PCL# PRF+ STEM + SUF# ECL## ## PCL# PRF+ STEM + SUF# ECL##

Nucleus-extensions representationNucleus-extensions representation

NFNF/ \/ \

aEF — pEF aEF — pEF / \ / \ / \ / \ PCL PRF SUF ECLPCL PRF SUF ECL

Page 19: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Arabic word-form structure

‘‘Traditional’ representation of the word-form Traditional’ representation of the word-form maximal maximal

_______word-form______________word-form_______ | | | |

minimal minimal ____word-form___ ____word-form___

| | | | ## PCL# PRF+ STEM + SUF# ECL## ## PCL# PRF+ STEM + SUF# ECL##

Nucleus-extensions representationNucleus-extensions representation

NFNF/ \/ \

aEF — pEF aEF — pEF / \ / \ / \ / \ PCL PRF SUF ECLPCL PRF SUF ECL

Page 20: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Constituents of the word-form in Arabic

proclitics (PCL) proclitics (PCL) prefix (PRF)prefix (PRF) a a stemstem (2 types) (2 types) suffixes (SUF)suffixes (SUF) enclitics (ECL)enclitics (ECL)

Page 21: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Hebrew word-form structures are similar to that of Arabic

Sampson (1985: 90-1) analyses graphic word-Sampson (1985: 90-1) analyses graphic word-forms in Hebrew. forms in Hebrew. Not surprisingly, word-form structure analyses in Not surprisingly, word-form structure analyses in Hebrew are to some extent akin to the one Hebrew are to some extent akin to the one presented here for Arabic.presented here for Arabic.

(Sampson, Geoffrey, 1985. (Sampson, Geoffrey, 1985. Writing systems.Writing systems. Stanford University Press.)Stanford University Press.)

Page 22: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Arabic word-form structures entail complex grammar-lexis relations

The The word-formative grammarword-formative grammar divides into divides into (1) (1) EF-EF rulesEF-EF rules and and (2) (2) NF-EF rulesNF-EF rules (Dichy, 1997).(Dichy, 1997).

Page 23: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Arabic word-form structures entail complex grammar-lexis relations

The The word-formative grammarword-formative grammar divides into divides into (1) (1) EF-EF rulesEF-EF rules and and (2) (2) NF-EF rules NF-EF rules (Dichy, 1997).(Dichy, 1997).

(1)(1) EF-EF rulesEF-EF rules purely belong to the grammar of the purely belong to the grammar of the language,language, e.g.: e.g.: If the proclitics include the preposition If the proclitics include the preposition bi-bi- or or li-li-, then the , then the

case-ending suffixes are that of the indirect case.case-ending suffixes are that of the indirect case. The proclitic article The proclitic article ?al-?al- excludes undetermined case endings excludes undetermined case endings

known as known as tanwîntanwîn..

Page 24: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Arabic word-form structures entail complex grammar-lexis relations

The The word-formative grammarword-formative grammar divides into divides into (1) (1) EF-EF rulesEF-EF rules and and (2) (2) NF-EF rules NF-EF rules (Dichy, 1997).(Dichy, 1997).

(1)(1) EF-EF rulesEF-EF rules purely belong to the grammar of the purely belong to the grammar of the languagelanguage

(2)(2) NF-EF rules NF-EF rules are correlated to NF categories and are correlated to NF categories and sub-categories. sub-categories.

Their field in the word-formative grammar is that of Their field in the word-formative grammar is that of grammar-lexis relations.grammar-lexis relations.

Page 25: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Morphotactic grammar-lexis relations: NF-EF relations - Type 1

NF-EF relationsNF-EF relations pertain partly to grammarpertain partly to grammar, e.g.: , e.g.: the proclitic article the proclitic article ?al-?al- is exclusively compatible with is exclusively compatible with

adjectives and common nouns;adjectives and common nouns; the proclitic morpheme the proclitic morpheme sa-sa-, which denotes the future of verbs, , which denotes the future of verbs,

is only compatible with imperfective verb stems;is only compatible with imperfective verb stems; and for a greater part to grammar-lexis relations:and for a greater part to grammar-lexis relations:

e.g.: enclitic pronouns are associated with verbs according to e.g.: enclitic pronouns are associated with verbs according to selection features such asselection features such as

<+ human vs. – human complements><+ human vs. – human complements>(( العقالء ~ غير إلى العقالء إلى العقالء ~ متعد غير إلى العقالء إلى :One can say, for example: ). One can say, for example .(متعدqara?tu-hu qara?tu-hu ((قرأتهقرأته), "I read it", but not *), "I read it", but not *qara?tu-humqara?tu-hum

Page 26: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Morphotactic grammar-lexis relations: NF-EF relations -Type 2

A large set of A large set of NF-EF relationsNF-EF relations involves involves "frozen" or "frozen" or "lexicalized" relations between nucleus and extension "lexicalized" relations between nucleus and extension formativesformatives, as opposed to compositional relations, e.g.:, as opposed to compositional relations, e.g.: • • The word The word jâmijâmicca&a& ( (جامعةجامعة) can be analysed either as: ) can be analysed either as:

(a) the active participle (a) the active participle jâmijâmicc "bringing together", "bringing together", "collecting", to which the fem. suffix "collecting", to which the fem. suffix –a&–a& is added, or as: is added, or as:

(b) a lexicalized compound including the meanings of the (b) a lexicalized compound including the meanings of the active participle and the suffix active participle and the suffix –a&–a& of the of the res generalisres generalis, , "the thing that..." (Roman, 1990). The whole compound, "the thing that..." (Roman, 1990). The whole compound, which includes a semantic addition (Dichy, 2002), means which includes a semantic addition (Dichy, 2002), means "university"."university".

In (a), the relation between In (a), the relation between jâmijâmicc and and –a&–a& is simply is simply compositional. In (b), it is clearly frozen or lexicalized compositional. In (b), it is clearly frozen or lexicalized (deriving from "the thing that brings together").(deriving from "the thing that brings together").

Page 27: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Grammar-lexis relations are finite

The two types of NF-EF relations account forThe two types of NF-EF relations account for a finite a finite andand exhaustive set of grammar-lexis relations exhaustive set of grammar-lexis relations, which , which operate in the domain of the Arabic word-form. operate in the domain of the Arabic word-form.

They have been formalized in Hassoun (1987) and They have been formalized in Hassoun (1987) and Dichy (1987, 1990, 1997), and implemented in the Dichy (1987, 1990, 1997), and implemented in the DIINAR.1 language database.DIINAR.1 language database.

They have also been extended to scientific They have also been extended to scientific terminological units (Lelubre, 2001).terminological units (Lelubre, 2001).

Page 28: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

DIINAR.1 (DIctionnaire INformatisé de l’Arabe),

Arabic acronym Ma‘âlî (Mu‘jam al-‘Arabiyya l-’âlî

اآللي – العربية معجم ــ ( مـعـالي A comprehensive Arabic Language dB of around A comprehensive Arabic Language dB of around

130,000 lemmas, comprising 130,000 lemmas, comprising approximately approximately 20,000 verbal entries, 79,000 deverbal 20,000 verbal entries, 79,000 deverbal

items, 29,000 nominal entries (to which 10,000 items, 29,000 nominal entries (to which 10,000 related "broken plural" items are attached), 1,000 related "broken plural" items are attached), 1,000 proper names and 450 grammatical tool-words (each proper names and 450 grammatical tool-words (each of which is associated with a specific grammar)of which is associated with a specific grammar). .

The resource also includes the clitics and affixes of The resource also includes the clitics and affixes of the language. the language.

Page 29: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

DIINAR.1 morphosyntactic specifiers

Each lexical unit is associated with Each lexical unit is associated with morphosyntactic specifiersmorphosyntactic specifiers accounting for accounting for grammar-lexis specifications operating at word-grammar-lexis specifications operating at word-form level. form level.

Specifiers also include Specifiers also include derivational links derivational links between morphologically related itemsbetween morphologically related items such as such as verb verb deverbal(s) or, in nouns: singular deverbal(s) or, in nouns: singular

“broken” plural, etc.“broken” plural, etc.

Page 30: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

DIINAR.1 morphosyntactic specifiers Each lexical unit is associated with Each lexical unit is associated with morphosyntactic morphosyntactic

specifiersspecifiers accounting for grammar-lexis specifications accounting for grammar-lexis specifications operating at word-form level. operating at word-form level.

Specifiers also include derivational links between Specifiers also include derivational links between morphologically related items such as morphologically related items such as verb verb deverbal(s) or, in nouns: singular deverbal(s) or, in nouns: singular “broken” plural, “broken” plural,

etc.etc.

DIINAR.1 has been completed by:DIINAR.1 has been completed by: in Tunis, IRSIT (A. Braham and S. Ghazali), and in Tunis, IRSIT (A. Braham and S. Ghazali), and in France at ENSSIB (Ecole Nationale Supérieure in France at ENSSIB (Ecole Nationale Supérieure

des Sciences de l'Information et des Bibliothèques des Sciences de l'Information et des Bibliothèques – M. Hassoun) and – M. Hassoun) and

the Lumithe Lumièère-Lyon 2 University (J. Dichy). re-Lyon 2 University (J. Dichy). (See: Dichy, Braham, Ghazali & Hassoun, 2002)(See: Dichy, Braham, Ghazali & Hassoun, 2002)

Page 31: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Morphosyntatic specifiers can only be based on stems

Grammar-lexis relations are not connected with Grammar-lexis relations are not connected with patterns. patterns.

They are not predictable on the sole basis of They are not predictable on the sole basis of roots and patterns, roots and patterns,

and can only be associated with actual lexical and can only be associated with actual lexical entries, entries,

which can only be identified in a stem-based which can only be identified in a stem-based lexicon.lexicon.

Page 32: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Morphosyntatic specifiers can only be based on stems

Grammar-lexis relations are not connected with Grammar-lexis relations are not connected with patterns. patterns.

They are not predictable on the sole basis of They are not predictable on the sole basis of roots and patterns, roots and patterns,

and can only be associated with actual lexical and can only be associated with actual lexical entries, entries,

which can only be identified in a stem-based which can only be identified in a stem-based lexicon.lexicon.

Page 33: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Morphosyntatic specifiers can only be based on stems

Grammar-lexis relations are not connected with Grammar-lexis relations are not connected with patterns. patterns.

They are not predictable on the sole basis of They are not predictable on the sole basis of roots and patterns, roots and patterns,

and can only be associated with actual lexical and can only be associated with actual lexical entries, entries,

which can only be identified in a stem-based which can only be identified in a stem-based lexicon.lexicon.

Page 34: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Morphosyntatic specifiers can only be based on stems

Grammar-lexis relations are not connected with Grammar-lexis relations are not connected with patterns. patterns.

They are not predictable on the sole basis of They are not predictable on the sole basis of roots and patterns, roots and patterns,

and can only be associated with actual lexical and can only be associated with actual lexical entries, entries,

which can only be identified in a stem-based which can only be identified in a stem-based lexicon.lexicon.

Page 35: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

The root, pattern and rule-based lexicon of the Xerox analyzer

Xerox Arabic morphological analyzer because it is Xerox Arabic morphological analyzer because it is accessible on the web:accessible on the web: http://www.xrce.xerox.com/research/mltt/arabichttp://www.xrce.xerox.com/research/mltt/arabic

Based on solid and innovative Based on solid and innovative finite-state technologyfinite-state technology ((Beesley, Kenneth and Karttunen Lauri, 2003. Beesley, Kenneth and Karttunen Lauri, 2003. Finite State Finite State

Morphology.Morphology. CSLI Publications, Stanford, California). CSLI Publications, Stanford, California). (Beesley, 2001)(Beesley, 2001)

The approach relies on previous research, including The approach relies on previous research, including Buckwalter's lexicon presently used at LDC Buckwalter's lexicon presently used at LDC (Maamouri & Cieri, 2002), and a contribution to Two-(Maamouri & Cieri, 2002), and a contribution to Two-level Morphology (Beesley, 1989/91)level Morphology (Beesley, 1989/91)

Page 36: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

The Xerox Analyzer/Generator Beesley (2001) takes up the idea that Beesley (2001) takes up the idea that Arabic Arabic

words consist of at least two building blocks: the words consist of at least two building blocks: the root and the prosodic templateroot and the prosodic template (McCarthy, 1981). (McCarthy, 1981).

Processes applying in generation/analysis:Processes applying in generation/analysis: First, the process of "interdigitation" or the "merging" First, the process of "interdigitation" or the "merging"

of roots and patterns to form stems. of roots and patterns to form stems. Second, alternation rules apply to perform deletion, Second, alternation rules apply to perform deletion,

epenthesis, assimilation, gemination and metathesis.epenthesis, assimilation, gemination and metathesis. Third, rules for short vowels and other diacritics are Third, rules for short vowels and other diacritics are

relaxed to allow for variations in the way Arabic relaxed to allow for variations in the way Arabic words are written.words are written.

Page 37: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Xerox Arabic Lexicons

Xerox has several lexicons (Beesley, 2001, p. 7):Xerox has several lexicons (Beesley, 2001, p. 7): The first is a The first is a lexicon of rootslexicon of roots, which contains , which contains 4,9304,930

entries. entries. The second is a The second is a dictionary of patternsdictionary of patterns, which includes , which includes

about 400 entriesabout 400 entries. . Each root-entry is manually coded and associated with Each root-entry is manually coded and associated with

patterns. The manual association of roots and patterns patterns. The manual association of roots and patterns produces about produces about 90,000 Arabic stems90,000 Arabic stems. .

When these stems combine with possible prefixes, When these stems combine with possible prefixes, suffixes and clitics by composition, suffixes and clitics by composition, 72 million abstract 72 million abstract words are generatedwords are generated..

Page 38: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Stem-based Arabic Lexicons (1)

Stem-based lexicons, compared to root-based Stem-based lexicons, compared to root-based ones, are more intuitive to buildones, are more intuitive to build (Farghaly (Farghaly and Senellart, 2003), more efficient, and and Senellart, 2003), more efficient, and easier to develop and extend.easier to develop and extend.

  FIRSTFIRST POINT POINT::

Unlike the entries of root-&-pattern grounded Unlike the entries of root-&-pattern grounded databases, databases, in a stem-based dictionary, in a stem-based dictionary, all the all the lemmas are actual lexical unitslemmas are actual lexical units – not abstract – not abstract or virtual items. or virtual items.

Page 39: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Stem-based Arabic Lexicons (2)

Pure root-&-pattern generation would Pure root-&-pattern generation would produce a lexicon of over 2 million stems:produce a lexicon of over 2 million stems: The Xerox lexicons comprise about 5,000 roots The Xerox lexicons comprise about 5,000 roots

and 400 patterns: 5,000 x 400 = 2,000,000and 400 patterns: 5,000 x 400 = 2,000,000 Manual generation and control has produced:Manual generation and control has produced:

a dictionary of around a dictionary of around 90,000 stem-entries at 90,000 stem-entries at Xerox,Xerox, and and

around around 120,000 stems in the DIINAR.1 database120,000 stems in the DIINAR.1 database..

Page 40: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Stem-based Arabic Lexicons (3) SECOND POINT:SECOND POINT:

In a lexicon In a lexicon (a) based on stems(a) based on stems, and , and (b)(b) with with entries associated with entries associated with word-form grammar-lexis word-form grammar-lexis specificationsspecifications, rule-governed combination with , rule-governed combination with prefixes, suffixes, proclitics and enclitics prefixes, suffixes, proclitics and enclitics only only generates existing Arabic word-forms.generates existing Arabic word-forms.

This is not the case of the This is not the case of the 72 million word-forms72 million word-forms generated from the 90,000 stemsgenerated from the 90,000 stems of the Xerox of the Xerox lexicon, which are clearly virtual or abstract lexicon, which are clearly virtual or abstract units. units.

Page 41: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

The Xerox Spanish Lexical Transducer and the DIINAR.1

Arabic dB The The Xerox Spanish Lexical TransducerXerox Spanish Lexical Transducer contained, contained,

in 1996: over in 1996: over 46,000 baseforms46,000 baseforms.. It analyzed and generated It analyzed and generated over 3,400,000 inflected over 3,400,000 inflected

wordforms wordforms (Beesley & Karttunen, 2003, p. xvii).(Beesley & Karttunen, 2003, p. xvii).

In the In the DIINAR.1 lexical database, DIINAR.1 lexical database, onlyonly 6.2 million 6.2 million existingexisting word-forms word-forms are generated from the are generated from the approximately approximately 120,000 stem-based entries 120,000 stem-based entries (Ouersighni, 2001).(Ouersighni, 2001).

Page 42: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy

Stem-based Arabic Lexicons (3)

THIRD POINTTHIRD POINT:: In a stem-based morphological analyzer and/or In a stem-based morphological analyzer and/or

generator, the process of generating stems from generator, the process of generating stems from underlying roots is eliminated altogether.underlying roots is eliminated altogether.

Arabic lexical dB-s based on stems associated with Arabic lexical dB-s based on stems associated with grammar-lexis specifications are crucial in the context grammar-lexis specifications are crucial in the context of MTof MT: : entries associated with morphosyntactic entries associated with morphosyntactic specifiers are much more compatible with the specifiers are much more compatible with the requirements of MT than virtual root-&-pattern requirements of MT than virtual root-&-pattern generated word-forms.generated word-forms.