digital information and heritage infuture zagreb, 7-9.11.2007. sentence alignment as the basis for...

20
INFuture Zagreb, 7- 9.11.2007. Future 2007 T he Future of Inform ation S ciences 7-9 N ovem ber 2007 Zagreb,C roatia IN Digital Information and Heritage Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of Humanities and Social Sciences – University of Zagreb Department of Information Sciences [email protected] Angelina Gašpar SOA Centre Split [email protected] Damir Pavuna Integra d.o.o. [email protected]

Upload: brianna-fisher

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Sentence Alignment as the Basis For Translation Memory Database

Sanja SeljanFaculty of Humanities and Social Sciences – University of Zagreb

Department of Information [email protected]

Angelina GašparSOA Centre Split

[email protected]

Damir PavunaIntegra d.o.o.

[email protected]

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Overview

I Introduction II When to use TMs?

Text preparationIII Corpus used

Text characteristicsIV Research

Tools usedAutomatic and manual alignementComparison of TMsResults

V Conclusion

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Sentence alignment (SA)

• basis for computer-assisted translation (CAT)• terminology management• term extraction• word alignment • cross-linguistic information retrieval

Sentence alignment (SA) -> translation memory (TM)basis for further research

in translation equivalencies

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Problems in automatic SA:

• robustness

• discrepancies in layout and omissions

• -> influence on accuracy and TM

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Research:

• SA on Cro-Eng parallel texts (laws, regulations, acts, decisions)

• alignment tool WinAlign 7.5.0 by SDL Trados 2006 Professional

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and HeritageAim:

• impact of SA process on the creation of TM

• comparison of 3 types of TMs

• Differences:

– in levels of expert intervention in set up of the alignment program

– in preparation of the source text for the segmentation

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

II When to use TMs?

• Fast and consistent translation (e.g. EU, multinational agencies)

• Voluminous texts • Highly repetitive types of texts • Use of specialized and consistent terminology• Several languages

• Sharing of common resources (cooperation)• Time-saving (Speed up the translation process)• Cost-saving• Consistent translation

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Directly through Directly through translationtranslation

Use of Use of already already translated materialtranslated material (alignment process)(alignment process)

Creation of TM

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

III Corpus used

• 9 parallel legislative Croatian-English texts or bitexts related to: acts, laws, regulations, decisions and ordinances;

• The sake of uniformity: standard presentation and standard formulas;

• 33.15% - percentage ratio for word count in English translations;

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

• Reasons:– English-an analytic type of language, use of

passive voice,– Croatian - a highly flective system, use of

active voice,

• Repetitive legal terms, phrases, sentences

• A regulation main components: the title, preamble, enacting terms, addresee, place, date and signature.

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

• Enacting terms - strict rules of presentation: - subject matter and scope, - definitions, - provisions conferring implementing power, - penalties or legal remedies, - transitional and final provisions.

• Standard form prescribes the layout on the page: spacing, paragraphing, punctuation and even typographic characteristics (capitalisation, typeface, boldface and italics)

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

• Use of verbs in enacting terms- Binding Croatian legislation:

- declarative terms (definitions, amendments) - and imperative terms (commands, prohibitions)

- English “shall”= Croatian present tense, modals (morati , trebati)

- English “may” for prohibition, permission and authorisation = Croatian present tense (“ne može se”, “može se”).

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

• Bitexts similarities : – punctuation, numbers, dates, foreign words;

• Differences: – capital letters, hyphens, compound words,

synonyms (avoided in target language);

• Common points: – consistent terminology, a uniform manner,

gender-neutral language;

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

IV Alignment research

• Texts:– Croatian legislative acts translations Cr->En;

• Tools: – AnyCount 4.0 (version 405) – for document

structure analysis – SDL Trados 2006 Professional (WinAlign

7.5.0.) – for alignment process;

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Alignment research

• PREPARATORY ACTIVITIES:– comparison of the source and target texts (whether all

text is translated)– defining set up of end and skip rules (delimiters,

creating abbreviation user list)– preparation of the source text for better segmentation

(spelling, automatic bullets and numbering, deleting of soft returns, hyphens, certain punctuation, tables created with tabs and revision marks)

– modification of set up rules– verification of the alignment (especially 1:2 and 2:1

pairs and commitment of pairs) – creation of translation memory and verification

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Alignment research

• Automatic alignment

WinAlign has language independent algorithms that count:– the quality of translation units which can

have tree levels (low, medium, high)– translation units aligning 1:2 or 2:1 pairs– unconnected target segments

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Alignment research

• Manual alignment– source text corresponds to translated target

segment (Aligned TM)– set up of the alignment program (Aligned TM

+ set up rules, e.g. segment and skip rules, abbreviation user list)

– segmentation of the source text (e.g. changes of soft returns, check of colon segmentation)

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Alignment research

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Alignment research

Raw TM Aligned TM + Setup rules

++ Segmented source

100% 121 106 112 120

95%-99% 0 0 0 0

85%-94% 2 5 0 0

75%-84% 2 2 1 0

50%-74% 1 1 2 0

No match 6 18 11 0

Total 132 132 126 120

Percent 91.67% 80.30% 88.89% 100%

INFutureZagreb, 7-9.11.2007.

Future 2007The Future of Information Sciences

7-9 November 2007Zagreb, Croatia

IN

Digital Information and Heritage

Alignment research

• Conclusion– The translation memories created in this study

out of different types of the alignment processes give different results regarding the quality of the translated material.

– The results show necessary interventions of an expert when defining the set up rules, in preparation activities for the source text segmentation and in the verification of suggested translation units.