1 clir: opening up possibilities for indigenous languages in south africa? research team: erica...

34
1 CLIR: opening up possibilities for indigenous languages in South Africa? Research team: Erica Cosijn 1 , Heikki Keskustalo 2 , Ari Pirkola 2 , Karen de Wet 1 & Kalervo Järvelin 2 1 University of Pretoria, Pretoria, South Africa 2 University of Tampere, Finland

Upload: lucas-mccormick

Post on 17-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

1

CLIR: opening up possibilities for indigenous languages in South Africa?

Research team: Erica Cosijn1, Heikki Keskustalo2, Ari Pirkola2, Karen de Wet1 & Kalervo Järvelin2

 

1University of Pretoria, Pretoria, South Africa

2University of Tampere, Finland

2

Introduction

• What is CLIR?

• General methodology

• Afrikaans-English CLIR

• Zulu-English CLIR

• The road ahead

• Conclusions

3

What is CLIR?

• The basic idea to bridge the language boundary by providing access in one language (the source language) to documents written in another language (the target language)

• Source language: the language that gives access to the required information; the quiery language thus

• Target language: the language of the content in the database

4

CLIR (cont.)

• Use CLIR in:– query translation and/or document translation

from the source language.

• Main strategies for query translation – dictionary-based methods – corpus-based methods, and – machine translation

5

CLIR approaches

• Corpus-based methods: work with frequency analysis– Implication: aboutness of the two collections

should be similar

•  Machine translation: uses morphological parser etc.

6

CLIR: Machine translation

• Translates source language texts into target language using:– Translation dictionaries– Other linguistic resources– Syntax analysis

• Limited availability

7

CLIR: Dictionary Based

• Problems– Limitations of dictionaries– Inflected word forms– Phrases and compound words– Lexical ambiguity

• Possible solution

– Approximate string matching

8

Source language

query

Dictionary translation

Bilingual source-Eng dictionary

English language

query

Other linguistic

resources

English language database

Retrieval in English

language database

English result

9

CLEF

The Cross-Language Evaluation Forum supports global digital library applications by (i) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in and (ii) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes

10

Retrieval system and test data

• Inquery – commercially available• Probabilistic – i.e. best match, not exact

• “Bag of words” or structured queries• used by Finnish partners in their projects

• TEST DATA: CLEF 2001 – 112 000 newspaper articles– 35 queries (title and description)– English to English baseline for comparison– 2 sets

• Afrikaans/Zulu title• Afrikaans/Zulu title and description

11

Afrikaans-English CLIR

• Afrikaans spoken by third largest group in South Africa as first language

• Originated mainly from Dutch• Germanic language• Not inflectional• Good technical vocabulary• Good resources – e.g. dictionaries, spell

checkers, parsers, compound splitters.

12

Methodology : Resources

• Electronic bilingual dictionary– Filtered commercial dictionary

• Stopword list– Translated from English and adapted

• Morphological analyzer– Derived statistically from analysis of large

newspaper text body

13

Dictionary Filtering

• Headwords identified by string-based rules• Alternative spellings separated and listed as separate

headwords• Homonyms: each sense listed as separate headword• Compounds identified and listed as separate

headwords• Plurals not included, but solved by morph analyzer• Manual checking and fine-tuning

14

Stopword list

• Translation of existing English stopword list• Check homonyms, e.g. again = weer = weather• Large text body – Afrikaans language newspaper

articles – 3500 words• Frequency analysis compared to translated list• Ad hoc additions• Accented words added• N=341

15

Morphological analyser (1)

• Based on patterns in language

• Newspaper text used for manual analysis

• 3500 words sorted by frequency facilitated duplicate removal

• 1200 unique words

16

Morphological analyser: Plurals

• All plural forms manually identified from 1200 words

• 62% of Afrikaans plurals formed by adding -e, -s or -’s to singular

• 13% of plurals have a double vowel in singular and plural is formed by removing one vowel and adding an -e to the end of the word

• Thus 75% of plurals solved by two simple rules

17

Morphological analyser: Affixes

Manual analysis of text shows

• Past tense indicated by ge- prefix, but sometimes embedded, e.g. aangesteek

• Various suffixes are common: -te, -ste, -er, -ing, -ke, -le, -de, etc.

• Suffix stripping possible by longest common substring (LCS) matching

18

Morphological analyser: Compounds

Manual analysis of text shows • Relatively high occurrence of compounds in

Afrikaans - 1%• Different types of compounds• With or without fogemorphemes (joining

morphemes)• Only two fogemorphemes identified,

namely -s- and -e-

19

Morphological analyser test data: Statistics - solvable

N % 1 Stopwords 150 14,0 2 Headwords 565 52,7 3 oo, aa, ee, uu rule solvable 18 1,7 4 e, s, ’s rule (OR Longest Common

Substring) 85 7,9

5 More LCS matching 59 5,5 6 Stripping prefix, e.g. ge- 13 1,2 7 Compound splitting (multiple LCS runs +

fogemorpheme stripping) 50 4,7

Total 940 87,7

20

Morphological analyser test data: Statistics – not solvable

N % 8 Compounds incorrectly solved 8 0,7% 9 Past tense -ge- embedded in word 26 2,4% 10 Not solvable by morphological analyser 16 1,5% 11 Misspelt in original text 2 0,2% Total 52 4,8% 12 Proper nouns 80 7,5%

21

N

Y

Y

N

N

N

N

N

N

N

Y

Y

Y

Y

Y

Y

Original Afrikaans query key

Preprocess Key (verify character set used: preserve both

Uppercase and Lowercase letters)

Is the key found as-is (i.e. as a translation dictionary entry)?

Does the key start with Uppercase letter?

Is the key found after removal of ge-prefix?

Is key recognized as plural of a “double

vowel singular case??

Is the key a compound (i.e. decomposable using LCS method)

Modify Lowercase to

Uppercase

Unrecognized Afrikaans key

Modify Uppercase to Lowercase

Remove the prefix from the

word

Normalize the word to

singular form

Decompose the word utilizing

fogemorphemes

Is the Uppercase form found as-is in

the dictionary?

Is the key a Stopword?

Remove

Fuzzy matching

(target index)

Most similar words from the English database

Is the word (or decomposed part) a stop word?

Remove

Translate using Afr-Eng

Dictionary

Word (or component)

translations in English

22

Morphological analyser – steps(condensed from flow chart)

• Match words found in dictionary• Uppercase becomes lower case• Remove ge- prefix• Double vowel plural case• Match longest common subsequence (suffixes as

well as compounds solved)• Modify lower case to uppercase (probably proper

noun)• Fuzzy match “as is” with target language database

23

ExampleDatabase used: Cleff

English title: Pesticides in Baby Food

Afrikaans source query: Plaagdoders in babakos

English baseline query: #sum(pesticide baby food)

The English target query translated from the Afrikaans source query: #sum(#syn(nullstr lues die van plague plague blight infestation pest affliction vexation killer) #syn( nullstr) #syn( baby food))

24

Results

25

Conclusions

• Dictionary probably too large

• Normalizer worked quite well

• Copmpound splitting by LCS methods mostly successful

• Stopword list adequate

• Results quite promising

26

Zulu-English CLIR

• isiZulu spoken by 8,8 million – largest number of speakers for a single language in SA

• Agglutinative – grammatical information conveyed by attaching pre- and suffixes to roots and stems

• Nouns: Grammatical genders – 8 classes in Zulu with distinctive prefixes in every class for singular and plural forms

• Verbs: Affixes mark grammatical relations such as object, subject, tense, mood, aspect

27

Methodology: Zulu to English

Approx. DictionaryMatching

ZuluSourceQuery

MonolingualZulu dictionary

Zulu baseform query

Retrieval inEnglish Database

CLEFEnglish

Database

EnglishResult

Dictionary translation

EnglishQuery

Zulu-Engl. Dictionary

28

Methodology (1)

• Monolingual word list– No electronic bilingual dictionary

• Approximate matching– Of all five metric and non-metric similarity

measures tested, skipgrams yielded best results– The Zulu word could be identified within three

words 80% of the time

29

Methodology (2)

• Translations from Zulu source words into English done manually

• Problems experienced in this process– Paraphrasing due to disparate vocabularies

E.g. isinyabulala – person weak from age

– Homonyms – single words with various meanings

E.g. –zwe isizwe izizwe = tribe OR rapidly spreading brain disease

30

Example of paraphrasingFind documents that describe acts of terrorism or vandalism against European synagogues since the end of the Second World War.

Thola imibhalo echaza Find scriptures that describe

izenzo zokuphekulazikhuni noma acts of terror and

izinto ngobudlova elwa the breaking of things

with violent force that fight

nezindlu zesonto zamaJuda the houses of Sunday of the Jews

ase-Europe kusukela ekupheleni of Europe from the end

kwezimpi zesibili zomhlaba of the war of second of the world

31

Analysis of translation problems

28

20

41

46

10

5

17

5

55

9

6

12

2

2

0 10 20 30 40 50 60

borrowed words

zululisations

proper names

paraphrases

palatalisations

pre-nasalisation

vowel coalescence

vowel elision

homonyms

locatives

conjunctives

verb extensions

enclitic

interrogative

Err

or

typ

es

Number of occurences

32

The road forward

• Parsers and morphological analysers in process

• Spellcheckers has extensive word lists• Increasing web presence of indiginous

languages, especially government sites and newspapers leads to possibility of pararlel corpora

• Cross Cultural Information Retrieval?

33

Conclusions

• Indigenous Knowledge is a valuable resource – it is important to make it accessible

• Learn from international research and create a good product from the outset

• Many opportunities for research

34

Cross Language Information Retrieval (CLIR)

• To provide access in one language to documents written in another language

• Query translation or document translation

• Approaches– Corpus-based techniques– Machine translation– Dictionary-based techniques