the use of machine translation tools for cross-lingual text-mining blaz fortuna jozef stefan...

The use of machine The use of machine translation tools for translation tools for

cross-lingual text-mining cross-lingual text-mining

Blaz FortunaJozef Stefan Institute, Ljubljana

John Shawe-TaylorSouthampton University

OutlineOutline

Cross-lingual text mining Kernel CCA Machine translation Information retrieval experiment Classification experiment Conclusions

Cross-lingual text miningCross-lingual text mining

When applying text mining to a multilingual text corpora specific language issues appear:

Information retrieval: retrieved documents should depend only on the meaning of the query and not its language.

Classification: only one classifier should be learned and not a separate classifier for each language

Clustering: documents should be grouped into clusters based on their content, not on the language they are written in.

KCCA KCCA (Kernel Canonical Correlation Analysis)(Kernel Canonical Correlation Analysis)

KCCA learns a semantic representation of the text from a corpus of unlabeled paired documents. On input we have set of paired documents (for each document we have a version in each language) On output we get set of mappings from native language space into “language independent space” – subspace with semantic dimensions

[Vinokourov et. al, 2002]

loss, income, company, quarter

verlust, einkommen, firma, viertel

wage, payment, negotiati-ons, union

zahlung, volle, gewerkschaft, verhand-lungsrunde

KCCA

Semanticdimensions

Paired training set and machine translationPaired training set and machine translation

KCCA needs paired dataset for training. When there is no paired dataset available we have two options:

We use human made dataset from some other domain. This could be unreliable because of a big semantic and

vocabulary gap. We use machine translation tools to generate paired

dataset. In our experiments we used Google Language Tools for

translating documents.

Experiments

We investigated how the quality of machine translation generated train set compares with a true human generated paired corpus.

Two major issues are addressed:

How much do we win or lose by using machine translation when a human generated corpus is available for

the target domain? only for a different domain?

Experiment #1 – Information retrievalExperiment #1 – Information retrieval

We compared two paired corpora: Hansard corpus: aligned pairs of text chunks from the official

records of the 36th Canadian Parliament Proceedings.

[Germann, 2001]

Artificial corpus: half of the English and half of the French translations from Hansard corpus were replaced by machine translation.

Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query.

The goal was to retrieve the paired document.

Experimental procedure (for each corpus):(1) KCCA trained on 1500 paired documents,(2) All 896 test documents (in both languages) projected into the

KCCA semantic space,(3) Each query was projected into the KCCA semantic space and

documents were retrieved using nearest neighbour based on cosine distance to the query.

ResultsResults

En-En En-Fr Fr-En Fr-Fr

Hansard 87 / 99 66 / 96 65 / 95 84 / 99

Artificial 86 / 99 58 / 91 59 / 90 83 / 99

For 65% of queries the correct document appeared on the first place.

For 95% of queries the correct document appeared among first 10 results.

There is no difference whenquery and document are in

the same language

When query and document are from different languages, there

is around 5-10% drop in retrieval accuracy

Experiment #2 – ClassificationExperiment #2 – Classification

Reuters multilingual corpora (English and French) was used as a dataset.

[Reuters, 2004]

First paired train set, Hansard, was taken from previous experiment; different domain than news articles.

Second paired train set was generated from the Reuters dataset using machine translation (Google).

Experimental procedure (for each corpus):(1) KCCA trained on 1500 paired documents,(2) Whole Reuters corpus was projected into the KCCA

semantic space,(3) Linear SVM classifier was learned in KCCA semantic

space on a subset of 3000 documents and tested on a subset of 50.000 (results are averaged over 5 random

splits).

ResultsResults

#KCCA dimensions: 800

FE … French training set, English testing set.

Artificial paired training set generates significantly better semantic space than train set taken from a different domain!

Conclusions

We have shown that the machine translation can be used to generate training set for Kernel CCA which can give almost as good performance as a train set made by human translators.

When no hand made translations are available this can significantly decrease the cost of a multi-lingual text mining.

We would like also to thank Miha Grcar for making an automated interface to Google Language Tools!

Questions?

the use of machine translation tools for cross-lingual text-mining blaz fortuna jozef stefan...

Documents