the metis project

21
The METIS Project Peter Dirix July 8, 2002 Centre for Computational Linguistics Katholieke Universiteit Leuven

Upload: ulfah

Post on 31-Jan-2016

60 views

Category:

Documents


0 download

DESCRIPTION

The METIS Project. Peter Dirix July 8, 2002 Centre for Computational Linguistics Katholieke Universiteit Leuven. The METIS Project. EU-sponsored project Statistical machine translation Partners: ILSP (Athens), KU Leuven - CCL Subcontractors: University of Antwerp, KUB (Tilburg). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The METIS Project

The METIS Project

Peter DirixJuly 8, 2002

Centre for Computational Linguistics

Katholieke Universiteit Leuven

Page 2: The METIS Project

The METIS Project

EU-sponsored project

Statistical machine translation

Partners: ILSP (Athens), KU Leuven - CCL

Subcontractors: University of Antwerp, KUB (Tilburg)

Page 3: The METIS Project

The METIS Project

MT = Holy Grail of computational linguistics

Since 50s: word-by-word systems

Later: rule-based systems

But: bottleneck reached since long

Since 80s: SMT, new techniques

Page 4: The METIS Project

The METIS Project

Disadvantage SMT: need large bilingual corpora (bitexts, usually not available)METIS: use only large monolingual corpora (widely available)Furthermore, you need bilingual lexicon and tag-mapping rulesOnly minimal effort for new language pair

Page 5: The METIS Project

The METIS Project

First language pairs: Dutch-English and Modern Greek-EnglishIn this internship: creation of lexical resources for Dutch and British EnglishThis means: monolingual corpora for Dutch and English and a bilingual lexicon Dutch-EnglishResources conform to PAROLE/EAGLES standardsCreation of tag-mapping rules

Page 6: The METIS Project

Corpora: Dutch

No extensive written Dutch corpus

Take parts of Corpus Spoken Dutch (CGN) consisting of read-aloud written text

Add written-text parts of Eindhoven corpus (newspaper texts of 60s and 70s)

Tilburg corpus (recent newspaper texts) is not available

Page 7: The METIS Project

Corpora: Dutch

Together:- CGN: 1,580,000 words (out of

10 million) - Eindhoven: 600,000 words (out of 720,000)

CGN has CGN tag set, Eindhoven has WOTAN tagset

Page 8: The METIS Project

Corpora: English

British National Corpus (BNC) is largest available text corpus for British English

About 100 million words

Tagged with CLAWS5 tagset

About 2 million words get enriched tagset (CLAWS6)

Very good tagging quality

Page 9: The METIS Project

Bilingual Lexicon: The Search

Criteria: correctness, generality, availability, cost, number of words

Our choice: combination

Dutch EuroWordNet & Ergane

Page 10: The METIS Project

Dutch EuroWordNet

Entry not given per word, but per synset (set of synonymous words)About 45,000 synsetsGives language-internal (semantic) relations, part of speech and equivalence link (translation) to American WordNet 1.5Fairly cheap (about 440 €)

Page 11: The METIS Project

Ergane

Multilingual Internet dictionaries

Uses Esperanto as interlingua

Dutch-English pair was available on the net

Contains about 50,000 translations

Free

Page 12: The METIS Project

Corpora

CGN: only punctuation needs to be reinserted

Eindhoven corpus: will be retagged with CGN tags and lemmatized

BNC: needs to be lemmatized

Tasks will be performed by Antwerp/Tilburg group

Page 13: The METIS Project

Format of bilingual lexicon

An Excel format was agreed upon

But, lexicon too big (Excel only allows 64K lines)

So text file with 3 fields per line (with Dutch lemma, English translation - only one per line - and PoS

Fields separated by tabs

Page 14: The METIS Project

Dutch EuroWordNet

Extract information from WordNet files, using Perl scriptsTwo WN files needed: the Dutch WordNet (DWN) and the Interlingual Index (ILI)DWN refers to ILI, using eq_synonym and eq_near_synonym links to translationsInformation of both lists was combined, using Perl scriptsPoS is also extracted from DWNFile in text format of target dictionary, about 100,000 lines

Page 15: The METIS Project

Ergane

Contains information in this form:aanbesteding: 1. tender | 2. public tender | 3. tender | 4. tender | 5. tender<BR>

Contains HTML tags: removed by Perl script; same for colons, numbers and bars

Each translation put in different entry

PoS is automatically assigned: n

File in text format of target dictionary, about 50,000 lines

Page 16: The METIS Project

Compiling one lexicon

Two lexica were merged into one file

Unix command-line program sort was used to put the list into alphabetical order and to remove duplicate entries

File with about 117,000 lines

Typos were corrected manually

Wrong translations were deleted

Page 17: The METIS Project

Compiling one lexicon

PoS was corrected manually, also the ones introduced in Ergane

Collocations were removed to separate file (PoS determined by use)

Difference in PoS between lexicon and CGN will be handled later in the project

Complete lexicon covers 115,756 lines

Page 18: The METIS Project

Tag-mapping rules

CGN tags purely on a word basis

Lemmatization to base form

Tag = list of lexical and morpho-syntactic features

Includes always PoS

Page 19: The METIS Project

Tag-mapping rules

BNC: CLAWS6 tagset is chosen

Also tagset on grammatical basis, but includes some semantics (e.g. name of months, …)

More general tag subsumes less general one

Page 20: The METIS Project

Tag-mapping rules

For each PoS category, map features and values from Dutch to EnglishE.g.: N(eigen,mv,*) NP, NP2, NPD2, NPM274 rules were constructed, sometimes to multiple-tag categories in EnglishNot implemented yet, because MATLAB environment was not ready yet

Page 21: The METIS Project

Conclusion

Lexical resources and tag-mapping rules needed for METIS were constructedNot easy to get appropriate resourcesProblems in the future:

* generality of tag-mapping rules* adjacency of collocations and separable verbs in Dutch