1 clir: opening up possibilities for indigenous languages in south africa? research team: erica...
TRANSCRIPT
1
CLIR: opening up possibilities for indigenous languages in South Africa?
Research team: Erica Cosijn1, Heikki Keskustalo2, Ari Pirkola2, Karen de Wet1 & Kalervo Järvelin2
1University of Pretoria, Pretoria, South Africa
2University of Tampere, Finland
2
Introduction
• What is CLIR?
• General methodology
• Afrikaans-English CLIR
• Zulu-English CLIR
• The road ahead
• Conclusions
3
What is CLIR?
• The basic idea to bridge the language boundary by providing access in one language (the source language) to documents written in another language (the target language)
• Source language: the language that gives access to the required information; the quiery language thus
• Target language: the language of the content in the database
4
CLIR (cont.)
• Use CLIR in:– query translation and/or document translation
from the source language.
• Main strategies for query translation – dictionary-based methods – corpus-based methods, and – machine translation
5
CLIR approaches
• Corpus-based methods: work with frequency analysis– Implication: aboutness of the two collections
should be similar
• Machine translation: uses morphological parser etc.
6
CLIR: Machine translation
• Translates source language texts into target language using:– Translation dictionaries– Other linguistic resources– Syntax analysis
• Limited availability
7
CLIR: Dictionary Based
• Problems– Limitations of dictionaries– Inflected word forms– Phrases and compound words– Lexical ambiguity
• Possible solution
– Approximate string matching
8
Source language
query
Dictionary translation
Bilingual source-Eng dictionary
English language
query
Other linguistic
resources
English language database
Retrieval in English
language database
English result
9
CLEF
The Cross-Language Evaluation Forum supports global digital library applications by (i) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in and (ii) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes
10
Retrieval system and test data
• Inquery – commercially available• Probabilistic – i.e. best match, not exact
• “Bag of words” or structured queries• used by Finnish partners in their projects
• TEST DATA: CLEF 2001 – 112 000 newspaper articles– 35 queries (title and description)– English to English baseline for comparison– 2 sets
• Afrikaans/Zulu title• Afrikaans/Zulu title and description
11
Afrikaans-English CLIR
• Afrikaans spoken by third largest group in South Africa as first language
• Originated mainly from Dutch• Germanic language• Not inflectional• Good technical vocabulary• Good resources – e.g. dictionaries, spell
checkers, parsers, compound splitters.
12
Methodology : Resources
• Electronic bilingual dictionary– Filtered commercial dictionary
• Stopword list– Translated from English and adapted
• Morphological analyzer– Derived statistically from analysis of large
newspaper text body
13
Dictionary Filtering
• Headwords identified by string-based rules• Alternative spellings separated and listed as separate
headwords• Homonyms: each sense listed as separate headword• Compounds identified and listed as separate
headwords• Plurals not included, but solved by morph analyzer• Manual checking and fine-tuning
14
Stopword list
• Translation of existing English stopword list• Check homonyms, e.g. again = weer = weather• Large text body – Afrikaans language newspaper
articles – 3500 words• Frequency analysis compared to translated list• Ad hoc additions• Accented words added• N=341
15
Morphological analyser (1)
• Based on patterns in language
• Newspaper text used for manual analysis
• 3500 words sorted by frequency facilitated duplicate removal
• 1200 unique words
16
Morphological analyser: Plurals
• All plural forms manually identified from 1200 words
• 62% of Afrikaans plurals formed by adding -e, -s or -’s to singular
• 13% of plurals have a double vowel in singular and plural is formed by removing one vowel and adding an -e to the end of the word
• Thus 75% of plurals solved by two simple rules
17
Morphological analyser: Affixes
Manual analysis of text shows
• Past tense indicated by ge- prefix, but sometimes embedded, e.g. aangesteek
• Various suffixes are common: -te, -ste, -er, -ing, -ke, -le, -de, etc.
• Suffix stripping possible by longest common substring (LCS) matching
18
Morphological analyser: Compounds
Manual analysis of text shows • Relatively high occurrence of compounds in
Afrikaans - 1%• Different types of compounds• With or without fogemorphemes (joining
morphemes)• Only two fogemorphemes identified,
namely -s- and -e-
19
Morphological analyser test data: Statistics - solvable
N % 1 Stopwords 150 14,0 2 Headwords 565 52,7 3 oo, aa, ee, uu rule solvable 18 1,7 4 e, s, ’s rule (OR Longest Common
Substring) 85 7,9
5 More LCS matching 59 5,5 6 Stripping prefix, e.g. ge- 13 1,2 7 Compound splitting (multiple LCS runs +
fogemorpheme stripping) 50 4,7
Total 940 87,7
20
Morphological analyser test data: Statistics – not solvable
N % 8 Compounds incorrectly solved 8 0,7% 9 Past tense -ge- embedded in word 26 2,4% 10 Not solvable by morphological analyser 16 1,5% 11 Misspelt in original text 2 0,2% Total 52 4,8% 12 Proper nouns 80 7,5%
21
N
Y
Y
N
N
N
N
N
N
N
Y
Y
Y
Y
Y
Y
Original Afrikaans query key
Preprocess Key (verify character set used: preserve both
Uppercase and Lowercase letters)
Is the key found as-is (i.e. as a translation dictionary entry)?
Does the key start with Uppercase letter?
Is the key found after removal of ge-prefix?
Is key recognized as plural of a “double
vowel singular case??
Is the key a compound (i.e. decomposable using LCS method)
Modify Lowercase to
Uppercase
Unrecognized Afrikaans key
Modify Uppercase to Lowercase
Remove the prefix from the
word
Normalize the word to
singular form
Decompose the word utilizing
fogemorphemes
Is the Uppercase form found as-is in
the dictionary?
Is the key a Stopword?
Remove
Fuzzy matching
(target index)
Most similar words from the English database
Is the word (or decomposed part) a stop word?
Remove
Translate using Afr-Eng
Dictionary
Word (or component)
translations in English
22
Morphological analyser – steps(condensed from flow chart)
• Match words found in dictionary• Uppercase becomes lower case• Remove ge- prefix• Double vowel plural case• Match longest common subsequence (suffixes as
well as compounds solved)• Modify lower case to uppercase (probably proper
noun)• Fuzzy match “as is” with target language database
23
ExampleDatabase used: Cleff
English title: Pesticides in Baby Food
Afrikaans source query: Plaagdoders in babakos
English baseline query: #sum(pesticide baby food)
The English target query translated from the Afrikaans source query: #sum(#syn(nullstr lues die van plague plague blight infestation pest affliction vexation killer) #syn( nullstr) #syn( baby food))
25
Conclusions
• Dictionary probably too large
• Normalizer worked quite well
• Copmpound splitting by LCS methods mostly successful
• Stopword list adequate
• Results quite promising
26
Zulu-English CLIR
• isiZulu spoken by 8,8 million – largest number of speakers for a single language in SA
• Agglutinative – grammatical information conveyed by attaching pre- and suffixes to roots and stems
• Nouns: Grammatical genders – 8 classes in Zulu with distinctive prefixes in every class for singular and plural forms
• Verbs: Affixes mark grammatical relations such as object, subject, tense, mood, aspect
27
Methodology: Zulu to English
Approx. DictionaryMatching
ZuluSourceQuery
MonolingualZulu dictionary
Zulu baseform query
Retrieval inEnglish Database
CLEFEnglish
Database
EnglishResult
Dictionary translation
EnglishQuery
Zulu-Engl. Dictionary
28
Methodology (1)
• Monolingual word list– No electronic bilingual dictionary
• Approximate matching– Of all five metric and non-metric similarity
measures tested, skipgrams yielded best results– The Zulu word could be identified within three
words 80% of the time
29
Methodology (2)
• Translations from Zulu source words into English done manually
• Problems experienced in this process– Paraphrasing due to disparate vocabularies
E.g. isinyabulala – person weak from age
– Homonyms – single words with various meanings
E.g. –zwe isizwe izizwe = tribe OR rapidly spreading brain disease
30
Example of paraphrasingFind documents that describe acts of terrorism or vandalism against European synagogues since the end of the Second World War.
Thola imibhalo echaza Find scriptures that describe
izenzo zokuphekulazikhuni noma acts of terror and
izinto ngobudlova elwa the breaking of things
with violent force that fight
nezindlu zesonto zamaJuda the houses of Sunday of the Jews
ase-Europe kusukela ekupheleni of Europe from the end
kwezimpi zesibili zomhlaba of the war of second of the world
31
Analysis of translation problems
28
20
41
46
10
5
17
5
55
9
6
12
2
2
0 10 20 30 40 50 60
borrowed words
zululisations
proper names
paraphrases
palatalisations
pre-nasalisation
vowel coalescence
vowel elision
homonyms
locatives
conjunctives
verb extensions
enclitic
interrogative
Err
or
typ
es
Number of occurences
32
The road forward
• Parsers and morphological analysers in process
• Spellcheckers has extensive word lists• Increasing web presence of indiginous
languages, especially government sites and newspapers leads to possibility of pararlel corpora
• Cross Cultural Information Retrieval?
33
Conclusions
• Indigenous Knowledge is a valuable resource – it is important to make it accessible
• Learn from international research and create a good product from the outset
• Many opportunities for research