avenue mapudungun

22
Data Collection and Analysis of Mapudungun Morphology for Spelling Correction Christian Monson, Lori Levin, Rodolfo Vega, Ralf Brown, Ariadna Font Llitjos, Alon Lavie, Jaime Carbonell, Eliseo Cañulef, Rosendo Huisca

Upload: saima

Post on 13-Jan-2016

67 views

Category:

Documents


9 download

DESCRIPTION

Data Collection and Analysis of Mapudungun Morphology for Spelling Correction. Christian Monson, Lori Levin, Rodolfo Vega, Ralf Brown, Ariadna Font Llitjos, Alon Lavie, Jaime Carbonell, Eliseo Cañulef, Rosendo Huisca. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: AVENUE Mapudungun

Data Collection and Analysis of

Mapudungun Morphology for Spelling Correction

Christian Monson, Lori Levin, Rodolfo Vega, Ralf Brown, Ariadna Font Llitjos,

Alon Lavie, Jaime Carbonell, Eliseo Cañulef, Rosendo Huisca

Page 2: AVENUE Mapudungun

AVENUE Mapudungun

• Instituto de Estudios Indígenas– Universidad de La Frontera, Temuco, Chile

• Programa de Educación Intercultural Bilingüe– Ministry of Education (Mineduc), Chile

• Language Technologies Institute – Carnegie Mellon University, USA

Page 3: AVENUE Mapudungun

Goals of AVENUE Mapudungun

Multicultural and Bilingual Education, Mineduc, Chile

AVENUE Project, CMU, USA

NLP tools for bilingual education:

On-line dictionaryBilingual CorpusSpelling checker

Basic skills taught in Spanish and mother tongue

Use of technology and networking even in rural areas

NLP tools for languages with low resources

Machine learning of morphology and translation rules

Page 4: AVENUE Mapudungun

Outline

• Overview of Mapudungun language

• Plan for on-line dictionary

• Progress on dictionary

• Plan for spelling checker

• Progress on spelling checker

Page 5: AVENUE Mapudungun

Mapudungun• Mapuche people

– Around 900,000 – Chile and Argentina

• Agglutinative/Polysynthetic– Up to 36 suffix slots (Smeets, 1989)

• Typical verb has five or six suffixes

– Noun incorporation• Noun goes immediately after the verb stem

– Vstem+(noun)+(suffixes)+last-suffix• Last suffix for finite verb is mood and person/number of agent

or patient, • Last suffix for non-finite verb is nominalization or

adverbialization• Other suffixes include aspect, negation, inversive, etc.

Page 6: AVENUE Mapudungun

Examples of Mapudungun verbsAmu -ke -yngüngo -habitual -3plIndicThey (usually) go

Ngütrümtu -a -lucall -fut -adverbWhile calling (tomorrow), …

nentu -ñma -nge -ymiextract -mal -pass -2sgIndicYou were extracted (on me)

ngütramka -me -a -fi -ñtell -loc -fut -3obj -1sgIndicI will tell her (away)

Page 7: AVENUE Mapudungun

Plans for Dictionary (Mineduc)• Tri-lingual (Spanish-Mapudungun-English); • Pronunciation for each word for each language• Example of use for each Mapudungun word• Specific users can exchange suggestions and alternate

pronunciations– Teachers and students of schools in the

PEIB/Orígenes program www.origenes.cl – Web-based, using Flash– based on shared lessons plans and network

communications• Vocabulary

– From the come from the corpus of spoken Mapudungun– From the Chilean curriculum for first four years of school– From the informatics domain

• User interface will be designed by Mineduc

Page 8: AVENUE Mapudungun

Corpus of spoken Mapudungun• 170 hours of speech

– 120 hours: Nguluche dialect– 30 hours: Lafkenche dialect– 20 hours: Pewenche dialect– 0 hours: Williche dialect

• Different and more endangered

• Mapuche interviewer and interviewees• Dialogues about health problems treated by doctor or

traditional healer.• Recorded with DAT recorder

– Some recordings are poor quality– Some high enough quality for training a speech recognizer

• Transcribed using TransEdit • Translated into Spanish by native speaker of

Mapudungun

Page 9: AVENUE Mapudungun

Examples from Mapudungun-Spanish corpus

nmlch-nmjm1_x_0405_nmjm_00:M: <SPA>no pütokovilu kay koC: no, si me lo tomaba con agua

M: chumgechi pütokoki femuechi pütokon pu <Noise>

C: como se debe tomar, me lo tomé pués

nmlch-nmjm1_x_0406_nmlch_00:M: ChengewerkelafuymiürkeC: Ya no estabas como gente entonces!

Page 10: AVENUE Mapudungun

Progress on Dictionary• Around 3000 Mapudungun words (stems and

fully inflected forms) – Spanish translation of the word– Sentence from the corpus of spoken Mapudungun

containing the word form– Spanish translation of the sentence, and – Reference into the corpus of spoken Mapudungun

identifying the specific cited sentence– For 1600 words

• segmentation of the word into morphemes• gloss for each morpheme

• Stored as a Word file with delimiters between fields.– Can be easily converted to other formats

Page 11: AVENUE Mapudungun

Examples from Dictionary

• Lichi: .? . / /. – leche.

translation– Feychi lichi, ¿chem lichingey? example– (Esta leche ¿qué leche es?) translation– nmlch-nmfhp1_x_0051_nmlch_00. Ec/Rh/Fc.

Ec/ Rh02-01-03. index

Page 12: AVENUE Mapudungun

Examples from Dictionary

• Kümekünueymu: – küme-künu-eymu. segmentation– bien-quedar-él(ella).a.ti .? . / /. gloss– te ha dejado muy bien. translation– Ka kümekünueymu tati. example– (Y te ha dejado muy bien). translation– nmlch-nmpll1_x_0070_nmlch_00. EC/RH03-02-03.

index

Page 13: AVENUE Mapudungun

Examples from Dictionary

• Mongepeürkelayan: – monge-pe-ürke-la-y-a-n. segmentation– sanar-tal.vez-acaso-no-0-futuro-yo .? . / /. gloss– no mejoraré tal vez. translation– Feytüfachi operalayaymi, operaeliyu l'ayaymi" pieneu.

"Mongepeürkelayan may" pin. Fey l'awen'tueneu, l'awen'tueneu; fey ka tripantun. example

– ("Esta vez no te vas a operar, si te opero te vas a morir" me dijo. "No mejoraré tal vez, entonces", dije. Entonces me medicinó, me medicinó; entonces también estuve un año). translation

– nmlch-nmpll1_x_0042_nmpll_00. Ec/Rh/Fc. Ec/ Rh23-12-02 index

Page 14: AVENUE Mapudungun

Plans for spelling checker• Goal: identify misspellings even for

morphologically complex words. • We don’t have a morphological analyzer

– Mapugungun speakers don’t know computational linguistics

– We don’t know Mapudungun– Currently training a field linguist from Argentina

(Roberto Aranovich) in computational linguistics– Research on automated morphology learning (Christian

Monson)

• We want the spelling checker to be compatible with a major word processor.

• Using MySpell and OpenOffice

Page 15: AVENUE Mapudungun

MySpell

• Open-source, standalone version of OpenOffice.org spell-checker

• Functional equivalent of Unix 'ispell'• Data files specify stems and classes of affixes

each base-form word specifies valid affix classes can condition applicability based on characters in

base-form word➔e.g. English plurals formed with -es if word ends in -ch

can modify base form prior to adding affix➔e.g. change -y to -ie before adding -s

• Limitation: at most one prefix and one suffix can be applied to each base form

Page 16: AVENUE Mapudungun

Plans for Spelling Checker

• MySpell for Mapudungun– Example of full segmentation

• Mongepeürkelayan • monge-pe-ürke-la-y-a-n. • no mejoraré tal vez.

– Example of segmentation for MySpell• monge stem• peürkelayan suffix string

Page 17: AVENUE Mapudungun

Progress on Spelling Checker

• Step 1: Devise spelling conventions– There are competing standards for

Mapudungun spelling– First version of spelling checker:

• AVENUE Mapudungun spelling standards by Cañulef, Huisca, Painequeo, and Carrasco

• Step 2: Get a list of “correctly” spelled words, according to the conventions.– Currently have “correct” spelling for the

70,000 most frequent words from the corpus

Page 18: AVENUE Mapudungun

Progress on Spelling Checkermost frequent 70,000 words corrected by hand

Frequency Rank Transcribed Word Form Spelling Corrected Word Form

………..……

103 feli feley104 pichikeche pichikeche105 kümey kümey

…………10,001 chumkunual chumkünuael10,002 puedelafuy puedelafuy10,003 tulayin tulayiñ10,004 kimngepelay kimngepelay…………

Can we use this list instead of stemming?

Page 19: AVENUE Mapudungun

The bad news

0

20

40

60

80

100

120

140

0 500 1,000 1,500

Tokens, in Thousands

Typ

es, i

n T

hous

ands

Mapudungun Spanish

Page 20: AVENUE Mapudungun

Progress on Spelling Checker

• Step 3: Iteration of stem/suffix boundaries– Start with 1600 segmented words from the dictionary– Identify the suffix strings– For the next most frequent 1000 words

• If the word ends in a known suffix string, insert a stem/suffix boundary

• Oversegments because we don’t check that the remaining stem is known after the suffix string is removed

– Native speakers correct the boundaries• 333 had to be corrected

– Two more iterations• Next most frequent 3000 words (579 were wrong)• Next most frequent 5000 words (1175 were wrong)

– Results in 9000 words with correct stem/suffix boundaries

Page 21: AVENUE Mapudungun

Effect of stemming on number of types

Mapudungun T-T Curve

0

1000

2000

3000

4000

5000

6000

7000

0 10000 20000 30000

Tokens

Typ

e With Stemming

Without Stemming

If the suffix string and the stem are in the list of 9000 correctly segmented words, treat it as an instance of the stem.

Otherwise, treat it as a new type.

Page 22: AVENUE Mapudungun

Conclusion

• Building tools that can be used for bilingual education in Chilean schools

• Large corpus of parallel corpus spoken Mapudungun translated into Spanish

• Small dictionary with examples from the corpus

• Can we build a spelling checker with MySpell?– We will let you know at a future conference.