the metis project

The METIS Project

Peter DirixJuly 8, 2002

Centre for Computational Linguistics

Katholieke Universiteit Leuven

The METIS Project

EU-sponsored project

Statistical machine translation

Partners: ILSP (Athens), KU Leuven - CCL

Subcontractors: University of Antwerp, KUB (Tilburg)

The METIS Project

MT = Holy Grail of computational linguistics

Since 50s: word-by-word systems

Later: rule-based systems

But: bottleneck reached since long

Since 80s: SMT, new techniques

The METIS Project

Disadvantage SMT: need large bilingual corpora (bitexts, usually not available)METIS: use only large monolingual corpora (widely available)Furthermore, you need bilingual lexicon and tag-mapping rulesOnly minimal effort for new language pair

The METIS Project

First language pairs: Dutch-English and Modern Greek-EnglishIn this internship: creation of lexical resources for Dutch and British EnglishThis means: monolingual corpora for Dutch and English and a bilingual lexicon Dutch-EnglishResources conform to PAROLE/EAGLES standardsCreation of tag-mapping rules

Corpora: Dutch

No extensive written Dutch corpus

Take parts of Corpus Spoken Dutch (CGN) consisting of read-aloud written text

Add written-text parts of Eindhoven corpus (newspaper texts of 60s and 70s)

Tilburg corpus (recent newspaper texts) is not available

Corpora: Dutch

Together:- CGN: 1,580,000 words (out of

10 million) - Eindhoven: 600,000 words (out of 720,000)

CGN has CGN tag set, Eindhoven has WOTAN tagset

Corpora: English

British National Corpus (BNC) is largest available text corpus for British English

About 100 million words

Tagged with CLAWS5 tagset

About 2 million words get enriched tagset (CLAWS6)

Very good tagging quality

Bilingual Lexicon: The Search

Criteria: correctness, generality, availability, cost, number of words

Our choice: combination

Dutch EuroWordNet & Ergane

Dutch EuroWordNet

Entry not given per word, but per synset (set of synonymous words)About 45,000 synsetsGives language-internal (semantic) relations, part of speech and equivalence link (translation) to American WordNet 1.5Fairly cheap (about 440 €)

Ergane

Multilingual Internet dictionaries

Uses Esperanto as interlingua

Dutch-English pair was available on the net

Contains about 50,000 translations

Free

Corpora

CGN: only punctuation needs to be reinserted

Eindhoven corpus: will be retagged with CGN tags and lemmatized

BNC: needs to be lemmatized

Tasks will be performed by Antwerp/Tilburg group

Format of bilingual lexicon

An Excel format was agreed upon

But, lexicon too big (Excel only allows 64K lines)

So text file with 3 fields per line (with Dutch lemma, English translation - only one per line - and PoS

Fields separated by tabs

Dutch EuroWordNet

Extract information from WordNet files, using Perl scriptsTwo WN files needed: the Dutch WordNet (DWN) and the Interlingual Index (ILI)DWN refers to ILI, using eq_synonym and eq_near_synonym links to translationsInformation of both lists was combined, using Perl scriptsPoS is also extracted from DWNFile in text format of target dictionary, about 100,000 lines

Ergane

Contains information in this form:aanbesteding: 1. tender | 2. public tender | 3. tender | 4. tender | 5. tender<BR>

Contains HTML tags: removed by Perl script; same for colons, numbers and bars

Each translation put in different entry

PoS is automatically assigned: n

File in text format of target dictionary, about 50,000 lines

Compiling one lexicon

Two lexica were merged into one file

Unix command-line program sort was used to put the list into alphabetical order and to remove duplicate entries

File with about 117,000 lines

Typos were corrected manually

Wrong translations were deleted

Compiling one lexicon

PoS was corrected manually, also the ones introduced in Ergane

Collocations were removed to separate file (PoS determined by use)

Difference in PoS between lexicon and CGN will be handled later in the project

Complete lexicon covers 115,756 lines

Tag-mapping rules

CGN tags purely on a word basis

Lemmatization to base form

Tag = list of lexical and morpho-syntactic features

Includes always PoS

Tag-mapping rules

BNC: CLAWS6 tagset is chosen

Also tagset on grammatical basis, but includes some semantics (e.g. name of months, …)

More general tag subsumes less general one

Tag-mapping rules

For each PoS category, map features and values from Dutch to EnglishE.g.: N(eigen,mv,*) NP, NP2, NPD2, NPM274 rules were constructed, sometimes to multiple-tag categories in EnglishNot implemented yet, because MATLAB environment was not ready yet

Conclusion

Lexical resources and tag-mapping rules needed for METIS were constructedNot easy to get appropriate resourcesProblems in the future:

* generality of tag-mapping rules* adjacency of collocations and separable verbs in Dutch

the metis project

Documents

dutch cgn

dutch lemma

dutch wordnet dwn

reinserted eindhoven

english translation

cgn tags

cgn tag set

largest available text