on-line compilation of comparable corpora and their evaluation radu ion, dan tufiŞ, tiberiu boroŞ,...

On-line Compilation of Comparable Corpora and

Their EvaluationRadu ION, Dan TUFIŞ, Tiberiu BOROŞ,

Alexandru CEAUŞU and Dan ŞTEFĂNESCUResearch Institute for Artificial Intelligence (RACAI)

FASSBL-7Dubrovnik, Croatia

October 4—6, 2010

Introduction Multilingual Comparable Corpora (MCC) is

usually easier to find and gather than parallel corpora

There are many types of MCC that pertain to the degree of relatedness: strongly, weakly, very non-parallel, etc. MCC

Our working definition (Munteanu & Marcu, 2006): a set of paired documents that even though are not translations of one another, are related and convey overlapping information

For instance news about your local favorite football team suffering a defeat last night

Document pairing in MCC It’s very important to acknowledge that in order to

be able to use large MCC, we need to pair documents from source and target languages

Suppose that we gather some type of news corpora (sports for instance) in two languages and we do that by streaming news sites in those languages

Suppose that we do not keep the documents themselves and we join them into one large document

Now if the source and target documents have 1M words per document (a very optimistic scenario), we will need at least 1M 1M = 1012 operations to word-align the documents !

But if we had 1000 documents with 1000 words each (in each of the languages) and managed to first align the documents, we would need 1000 10002 = 109 op.

Wikipedia as an MCC corpus Wikipedia is an extremely valuable resource in

that is a free collection of (generally) good quality articles that have versions in many languages

Many of the articles on Wikipedia are linked with their versions in other languages, a feature that makes it an inherently large MCC corpus

English Wikipedia has 3,431,874 articles, Romanian Wikipedia has 150,797 articles

We have employed two different strategies of building MCC from Wikipedia: using Romanian “quality articles” (very good quality

articles that are complete, well written, approved by senior Wikipedia administrators)

using Princeton English WordNet (to be explained…)

MCC from Wikipedia quality articles Having a list of Romanian quality articles … We have gathered 128 pairs of English-Romanian

documents from Wikipedia (602K/502K words) using one of the following heuristics: Following the English link from the Romanian article

gave us the English pair of the Romanian document English articles that had the exact same name as

Romanian articles (“Alicia Keys”, “Evanescence”, etc.) We automatically translate the title of the Romanian

page into an English query by using translation lexicons (we consider the first 2 translations for every Romanian content word). We retrieve the first 10 results and manually find the pair of the Romanian document but an automatic method is also available (to be described…)

MCC from Wikipedia using WordNet Using Princeton WordNet (wordnet.princeton.edu

), extract a list of named entities (literals that are capitalized and usually in the “instance_of” relation with their parents)

Transform these literals in Wikipedia page names by replacing spaces with underscore (“_”) and adding the Wikipedia URL prefix en.wikipedia.org/wiki/

Extract all English pages we can find and for each page, the Romanian and/or German versions if they exist by following the interlingual links

We strip the HTML information from the documents retaining only the UTF-8 text and we also store the categories of each document in order to be able to select different domain corpora

http://wordnet.princeton.edu/

http://en.wikipedia.org/wiki/

Sizes of Collected MCC corpora

Using the WordNet named entities method we were able to gather the following data (in thousands of words):

Document pairing in MCC The problem is to automatically pair documents

(1:1 mapping) from the source language set with those in the target language set

In order to do this we replaced each word in every document with its translation equivalent pairs imposing a limit of at most 3 translations and also considering only those source words that have a low translation entropy score (at most 0.5)

If two candidate documents are represented as binary vectors x = (x1, x2, …, xn) and y = (y1, y2, …, yn) in which a position is 1 if the corresponding term is found in the document …

Percent disagreement d(x, y) The percent disagreement measure is the best

measure that differentiates the best between good pairs and bad ones (tested against Euclidean, Squared Euclidean and Manhattan distances)

We managed to obtain a 72% accuracy when aligning the 128 documents test set (the quality articles) from Romanian Wikipedia

Focused MCC crawling Usually the task of collecting corpora from

the web is undertaken once and then all the related tools and resources are forgotten …

Until a new corpus is expected to be built in which case, the whole suite of scripts is usually rewritten in order to cope with the new requirements

In order to avoid the unnecessary duplication of work, we developed a graphical web crawler that, based on a input list of URLs, crawls the web, stores the documents in text form and, optionally, runs them through a suite of NLP tools at the user’s choice

The script-based web crawler

Conclusions Comparable corpora is easier to obtain than

parallel corpora and in the ACCURAT project (http://www.accurat-project.eu/), we intend to exploit comparable corpora in order to obtain parallel data that will complement and improve existing translation models

We have collected around 46M words worth of English-Romanian comparable corpora and around 26M words of Romanian-German comparable corpora from Wikipedia

We have also developed a generic graphic web crawler that will collect even more comparable corpora from the web

http://www.accurat-project.eu/

on-line compilation of comparable corpora and their evaluation radu ion, dan tufiŞ, tiberiu boroŞ,...

Documents

romanian wikipedia

wikipedia quality articles

romanian documents

mcc corpus wikipedia

romanian page

good quality articles

romanian articles alicia

types of mcc