the use of machine translation tools for cross-lingual text-mining blaz fortuna jozef stefan...
TRANSCRIPT
The use of machine The use of machine translation tools for translation tools for
cross-lingual text-mining cross-lingual text-mining
Blaz FortunaJozef Stefan Institute, Ljubljana
John Shawe-TaylorSouthampton University
OutlineOutline
Cross-lingual text mining Kernel CCA Machine translation Information retrieval experiment Classification experiment Conclusions
Cross-lingual text miningCross-lingual text mining
When applying text mining to a multilingual text corpora specific language issues appear:
Information retrieval: retrieved documents should depend only on the meaning of the query and not its language.
Classification: only one classifier should be learned and not a separate classifier for each language
Clustering: documents should be grouped into clusters based on their content, not on the language they are written in.
KCCA KCCA (Kernel Canonical Correlation Analysis)(Kernel Canonical Correlation Analysis)
KCCA learns a semantic representation of the text from a corpus of unlabeled paired documents. On input we have set of paired documents (for each document we have a version in each language) On output we get set of mappings from native language space into “language independent space” – subspace with semantic dimensions
[Vinokourov et. al, 2002]
loss, income, company, quarter
verlust, einkommen, firma, viertel
wage, payment, negotiati-ons, union
zahlung, volle, gewerkschaft, verhand-lungsrunde
KCCA
Semanticdimensions
Paired training set and machine translationPaired training set and machine translation
KCCA needs paired dataset for training. When there is no paired dataset available we have two options:
We use human made dataset from some other domain. This could be unreliable because of a big semantic and
vocabulary gap. We use machine translation tools to generate paired
dataset. In our experiments we used Google Language Tools for
translating documents.
Experiments
We investigated how the quality of machine translation generated train set compares with a true human generated paired corpus.
Two major issues are addressed:
How much do we win or lose by using machine translation when a human generated corpus is available for
the target domain? only for a different domain?
Experiment #1 – Information retrievalExperiment #1 – Information retrieval
We compared two paired corpora: Hansard corpus: aligned pairs of text chunks from the official
records of the 36th Canadian Parliament Proceedings.
[Germann, 2001]
Artificial corpus: half of the English and half of the French translations from Hansard corpus were replaced by machine translation.
Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query.
The goal was to retrieve the paired document.
Experimental procedure (for each corpus):(1) KCCA trained on 1500 paired documents,(2) All 896 test documents (in both languages) projected into the
KCCA semantic space,(3) Each query was projected into the KCCA semantic space and
documents were retrieved using nearest neighbour based on cosine distance to the query.
ResultsResults
En-En En-Fr Fr-En Fr-Fr
Hansard 87 / 99 66 / 96 65 / 95 84 / 99
Artificial 86 / 99 58 / 91 59 / 90 83 / 99
For 65% of queries the correct document appeared on the first place.
For 95% of queries the correct document appeared among first 10 results.
There is no difference whenquery and document are in
the same language
When query and document are from different languages, there
is around 5-10% drop in retrieval accuracy
Experiment #2 – ClassificationExperiment #2 – Classification
Reuters multilingual corpora (English and French) was used as a dataset.
[Reuters, 2004]
First paired train set, Hansard, was taken from previous experiment; different domain than news articles.
Second paired train set was generated from the Reuters dataset using machine translation (Google).
Experimental procedure (for each corpus):(1) KCCA trained on 1500 paired documents,(2) Whole Reuters corpus was projected into the KCCA
semantic space,(3) Linear SVM classifier was learned in KCCA semantic
space on a subset of 3000 documents and tested on a subset of 50.000 (results are averaged over 5 random
splits).
ResultsResults
#KCCA dimensions: 800
FE … French training set, English testing set.
Artificial paired training set generates significantly better semantic space than train set taken from a different domain!
Conclusions
We have shown that the machine translation can be used to generate training set for Kernel CCA which can give almost as good performance as a train set made by human translators.
When no hand made translations are available this can significantly decrease the cost of a multi-lingual text mining.
We would like also to thank Miha Grcar for making an automated interface to Google Language Tools!
Questions?