amta'2008 translation universals

28
Translation universals: do they exist? A corpus-based NLP study of convergence and simplification Gloria Corpas*, Ruslan Mitkov**, Naveed Afzal**, Viktor Pekar*** * University of Málaga ** University of Wolverhampton *** Oxford University Press

Upload: naveed-afzal

Post on 18-Aug-2015

20 views

Category:

Documents


1 download

TRANSCRIPT

Translation universals: do they exist?A corpus-based NLP study of convergence and simplification

Gloria Corpas*, Ruslan Mitkov**, Naveed Afzal**, Viktor Pekar***

* University of Málaga** University of Wolverhampton

*** Oxford University Press

Translation universals (Baker 1993, 1996; Toury 1995)

Translated texts tend to be simpler than non-translated, original texts (simplification)Translated texts tend to be more explicit than non-translated texts (explicitation)Translated texts tend to be more similar than non-translated texts (convergence)

Previous research on translation universals

Formulation and initial explanation been based of intuition and introspection Follow-up corpus research limited to comparatively small-size corpora, literary or newswire texts and semi-manual analysisNo sufficient guidance as to which are the features which account for these universals to be regarded as valid

Objective of this study

To test the validity of convergence (translated texts tend to be more similar than non-translated texts)To test the validity of simplification (translated texts tend to be simpler than non-translated texts)To propose features which account for convergence y simplificationTest (target) language: Spanish

General methodology: convergence

Employment of NLP techniques on corpora of translated Spanish and on comparable corpora of non-translated (original) Spanish Similarity between every pair of corpora of translated texts and between every pair of corpora of original texts computedSimilarity is measured in terms of both style and syntax

General methodology: simplification

Employment of NLP techniques on corpora of translated Spanish and on comparable corpora of non-translated (original) Spanish For every corpus a set of lexical and stylistic features computed and compared with its comparable counterpart

Corpora usedCorpus of Medical Spanish Translations by Professionals (MSTP: 1,058,122) Corpus of Medical Spanish Translations by Students (MSTS: 1,058,122)Corpus of Technical Spanish Translations (TST: 1,736,027)Corpus of Original Medical Spanish Comparable to Translations by Professionals (MSTPC: 1,402,172) Corpus of Original Medical Spanish Comparable to Translations by Students (MSTSC: 1,164,435)Corpus of Original Technical Spanish Comparable to Technical Translations (TSTC: 1,986,651)

Comparability of corpora

Comparability in terms of

(i) Text types and forms (ii) Domains and sub domains (iii) Level of specialisation and formality (iv) Diatopic restrictions (Peninsular Spanish) (v) Time span (2005-2008) (vi) Similar size

CORPUS DESIGNCORPUS DESIGN

NONTRANSLATED

CORPUS

MSC

MSTSC MSTPC

TSTC

ES (TT) ES (NT)

Study 1: ConvergenceSpecific methodology (1)

Compared: all 3 pairs of translated texts (MSTP-MSTS; MSTS-TST; MSTP-TST) all 3 pairs of comparable non- translated texts (MSTPC-MSTSC; MSTSC-TSTC; MSTPC-TSTC)

Premise: If convergence universals holds, higher similarity for pairs of translated texts expected.

Study 1: ConvergenceSpecific methodology (2)

Texts compared on the basis of (i) style (stylistic features)(ii) syntax (syntactic features).

Our proposal for stylistic and syntactic features

Style comparison: stylistic features

Lexical density: (number of types)/

(total number of tokens present in corpus)

Lexical richness: (number of lemmas)/

(number of tokens present in corpus)

Sentence length:(number of tokens in corpus)/

(number of sentences)

Style comparison: stylistic features (2)

Simple/complex sentencesDiscourse markers (Spanish)Two statistical tests (Chi-Square test and T-test) employed

Syntax comparison

Sequences of POS tags for every pair of corpora comparedCorpora represented as frequency vectors of 3-grams (Nerbonne and Wiersma, 2006)Measures:

Cosine Recurrence metrics R and Rsq (Kessler, 2001)

Experimental results

Computation of stylistic featuresChi-square values for global comparisonT-test values for statistical significanceMeasuring vector differences for syntax comparison

Style comparison: Stylistic Features

Features MSTP MSTS TST MSTPC MSTSC TSC

Lexical Density

0.027954 0.052715 0.020679 0.042505 0.041159 0.025529

Lexical Richness

0.016929 0.037709 0.013281 0.029992 0.028905 0.015591

Average Sentence Length

25.256248 28.499456 27.292782 20.702349 26.442412 18.124363

Simple Sentences (%)

0.441768121 0.507205751 0.476949103 0.638889238 0.52120611 0.592110096

Discourse Markers (Ratio)

0.001268941 0.001852604 0.000763805 0.002022331 0.002099085 0.001649655

Style comparison: Chi-Square Values

Corpora Chi-Square Values

1MSTP 2MSTS 0.010622566

1MSTP 3TST 0.00266151

2MSTS 3TST 0.023731912

Total 0.037015988

Average 0.012338663

Corpora Chi-Square Values

1MSTPC 2MSTSC 0.059779549

1MSTPC 3TSC 0.006140764

2MSTSC 3TSC 0.07122404

Total 0.137144352

Average 0.045714784

Translated Corpora Non-Translated Corpora

Style comparison: T-Test ValuesFeatures Translated Corpora (T-test Values)

MSTP MSTS MSTS TST MSTP TST

Non-translated Corpora (T-test Values)

MSTPC MSTSC MSTSC TSC MSTPCTSC

Lexical Density 0.002545387 0.000123172 0.079875166 0.140348431 0.201151185 0.000748439

Lexical Richness 0.0006604 0.000006.9792 0.140236542 0.140711253 0.015893183 0.00009.71905

Sentence Length 0.011826639 0.522122939 0.202480843 0.145216739 0.002807505 0.368840258

Simple Sentences 0.057465277 0.673936375 0.202830407 0.096465071 0.462960518 0.21217697

Discourse Markers 0.001048007 0.005746253 0.351552034 0.063428055 0.00084074 0.072337471

Syntax comparison: Results Measuring Vector Differences

Corpora 1-C R Rsq

Translated texts

MSTP - MSTS 0.206015066283 252526.914323 638848591.082

MSTP - TST 0.337626383799 388466.504863 3146471863.13

MSTS - TST 0.176310545152 432725.578482 2643068563.82

Non-Translated texts

MSTPC - MSTSC 0.0176469276126 98448.0858054 82218137.9687

MSTPC - TSC 0.150912596476 364322.217714 851312764.364

MSTSC - TSC 0.167167511143 372940.61477 1008322991.78

Convergence: discussion (1)

Stylistic features: translated texts included in experiment are more similar than non-translated texts (Chi-square test)

Convergence: discussion (2)

T-test observationsThere are non-translated texts which are not statistically different in terms of stylistic features whereas corresponding translated texts different statisticallyThere are non-translated texts which are statistically different in terms of only one stylistic feature whereas corresponding translated texts different statistically with regard to two stylistic features Translated texts could often differ significantly with regard to certain style features (lexical density).

Convergence: discussion (3)

Translated texts differ more in terms of syntax for all compared pairs and from the point of view of all measures (1-C, R and Rsq)

Study 2: SimplificationSpecific methodology

Stylistic features accounting for ‘simple’ textsSentence lengthSimple vs. Complex sentences

ReadabilityAutomated Readability Index (ARI)Coleman-Liau Index (CLI)Flesch-Kincaid Grade Level Readibility Test (FK)

Results compared across pairs of corpora

Comparison of mean values of the lexical and stylistic features between corresponding comparable corpora

Features MTP-MTPC MTS-MTSC TT-TTC

MTP MTPC α MTS MTSC α TT TTC α

Lexical Density .027 .042 0.005 .052 .041 0.4 .02 .025 0.001

Lexical Richness .016 .029 0.005 .037 .028 0.4 .013 .015 0.001

Average Sentence Length

25.25 20.70 0.2 28.49 26.44 0.1 27.29 18.12 0.001

Simple Sentences (%)

.441 .638 0.01 .507 .521 0.7 .476 .592 0.002

Discourse Markers (Ratio)

.0012 .002 0.05 .0018 .0021 0.2 .0007 .0016 0.002

ARI 16.85 15.08 0.4 19.14 19.01 0.75 17.85 12.85 0.001

CLI 16.27 16.9 0.3 17.16 18.28 0.05 16.28 15.5 0.1

FK 19.53 18.21 0.5 21.32 21.51 0.5 20.03 15.46 0.001

Simplification: discussion

Mixed pictureSimplification confirmed on

Lexical richnessLexical densityReadability

Simplification not confirmed onSentence lengthProportion of simple sentences

Implications for translation universalsConvergence

Style: convergence appears to be broadly holding, but no definite conclusion can be made that convergence is a clear-cut universal Syntax: there is no evidence that convergence holds in terms of syntaxGeneral: results do not provide sufficient support to the convergence ‘universal’

SimplificationMixed picture: no sufficient support for simplification

Implication for Machine Translation

Given the mixed picture, not manyBut: translated text have to be more readable than non-translated textMore research is needed as to which features are ‘stable’Included into an MT model?

Conclusions

There is no sufficient evidence/support that translation universals (convergence, simplification) holdFeatures which appear to be ‘stable’ (e.g. readability) could be modelled into MT systems