the metis project
DESCRIPTION
The METIS Project. Peter Dirix July 8, 2002 Centre for Computational Linguistics Katholieke Universiteit Leuven. The METIS Project. EU-sponsored project Statistical machine translation Partners: ILSP (Athens), KU Leuven - CCL Subcontractors: University of Antwerp, KUB (Tilburg). - PowerPoint PPT PresentationTRANSCRIPT
The METIS Project
Peter DirixJuly 8, 2002
Centre for Computational Linguistics
Katholieke Universiteit Leuven
The METIS Project
EU-sponsored project
Statistical machine translation
Partners: ILSP (Athens), KU Leuven - CCL
Subcontractors: University of Antwerp, KUB (Tilburg)
The METIS Project
MT = Holy Grail of computational linguistics
Since 50s: word-by-word systems
Later: rule-based systems
But: bottleneck reached since long
Since 80s: SMT, new techniques
The METIS Project
Disadvantage SMT: need large bilingual corpora (bitexts, usually not available)METIS: use only large monolingual corpora (widely available)Furthermore, you need bilingual lexicon and tag-mapping rulesOnly minimal effort for new language pair
The METIS Project
First language pairs: Dutch-English and Modern Greek-EnglishIn this internship: creation of lexical resources for Dutch and British EnglishThis means: monolingual corpora for Dutch and English and a bilingual lexicon Dutch-EnglishResources conform to PAROLE/EAGLES standardsCreation of tag-mapping rules
Corpora: Dutch
No extensive written Dutch corpus
Take parts of Corpus Spoken Dutch (CGN) consisting of read-aloud written text
Add written-text parts of Eindhoven corpus (newspaper texts of 60s and 70s)
Tilburg corpus (recent newspaper texts) is not available
Corpora: Dutch
Together:- CGN: 1,580,000 words (out of
10 million) - Eindhoven: 600,000 words (out of 720,000)
CGN has CGN tag set, Eindhoven has WOTAN tagset
Corpora: English
British National Corpus (BNC) is largest available text corpus for British English
About 100 million words
Tagged with CLAWS5 tagset
About 2 million words get enriched tagset (CLAWS6)
Very good tagging quality
Bilingual Lexicon: The Search
Criteria: correctness, generality, availability, cost, number of words
Our choice: combination
Dutch EuroWordNet & Ergane
Dutch EuroWordNet
Entry not given per word, but per synset (set of synonymous words)About 45,000 synsetsGives language-internal (semantic) relations, part of speech and equivalence link (translation) to American WordNet 1.5Fairly cheap (about 440 €)
Ergane
Multilingual Internet dictionaries
Uses Esperanto as interlingua
Dutch-English pair was available on the net
Contains about 50,000 translations
Free
Corpora
CGN: only punctuation needs to be reinserted
Eindhoven corpus: will be retagged with CGN tags and lemmatized
BNC: needs to be lemmatized
Tasks will be performed by Antwerp/Tilburg group
Format of bilingual lexicon
An Excel format was agreed upon
But, lexicon too big (Excel only allows 64K lines)
So text file with 3 fields per line (with Dutch lemma, English translation - only one per line - and PoS
Fields separated by tabs
Dutch EuroWordNet
Extract information from WordNet files, using Perl scriptsTwo WN files needed: the Dutch WordNet (DWN) and the Interlingual Index (ILI)DWN refers to ILI, using eq_synonym and eq_near_synonym links to translationsInformation of both lists was combined, using Perl scriptsPoS is also extracted from DWNFile in text format of target dictionary, about 100,000 lines
Ergane
Contains information in this form:aanbesteding: 1. tender | 2. public tender | 3. tender | 4. tender | 5. tender<BR>
Contains HTML tags: removed by Perl script; same for colons, numbers and bars
Each translation put in different entry
PoS is automatically assigned: n
File in text format of target dictionary, about 50,000 lines
Compiling one lexicon
Two lexica were merged into one file
Unix command-line program sort was used to put the list into alphabetical order and to remove duplicate entries
File with about 117,000 lines
Typos were corrected manually
Wrong translations were deleted
Compiling one lexicon
PoS was corrected manually, also the ones introduced in Ergane
Collocations were removed to separate file (PoS determined by use)
Difference in PoS between lexicon and CGN will be handled later in the project
Complete lexicon covers 115,756 lines
Tag-mapping rules
CGN tags purely on a word basis
Lemmatization to base form
Tag = list of lexical and morpho-syntactic features
Includes always PoS
Tag-mapping rules
BNC: CLAWS6 tagset is chosen
Also tagset on grammatical basis, but includes some semantics (e.g. name of months, …)
More general tag subsumes less general one
Tag-mapping rules
For each PoS category, map features and values from Dutch to EnglishE.g.: N(eigen,mv,*) NP, NP2, NPD2, NPM274 rules were constructed, sometimes to multiple-tag categories in EnglishNot implemented yet, because MATLAB environment was not ready yet
Conclusion
Lexical resources and tag-mapping rules needed for METIS were constructedNot easy to get appropriate resourcesProblems in the future:
* generality of tag-mapping rules* adjacency of collocations and separable verbs in Dutch