irf symposium 2007 vienna, austria november 8-9, 2007, mariott hotel presentation: machine...

34
IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO, Head of Research, EPO

Upload: jayce-braddock

Post on 31-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

IRF Symposium 2007 Vienna, AustriaNovember 8-9, 2007, Mariott Hotel

Presentation: Machine Translation Chinese-EnglishSome experiments

Dr. Barrou DIALLO, Head of Research, EPO

Page 2: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

2

EPO Research The case of Machine Translation

Our Vision & Mission

MT versus Patents

The Chinese language caseOur Experiments

Our Accomplishments

Perspectives

Page 3: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

3

Our Vision & Mission (1/3)

R&D center as a source of Efficiency:

• Efficient Reading

• Accurate Searching

• Fast Granting

Our Vision: Turning Technology into IP Business

Page 4: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

4

The EPO Research Department

Merged in March 2007 in a new Information Management structure; became "horizontal"

Located in The Hague, Netherlands Large portfolio of academic contacts (Labs, Universities) Entry point for testing and evaluating industrial solutions since 1990 Partnerships with International institutions (WIPO, EC) Strong background in mathematics, algorithms, and data structures Network of active users and testers inside the EPO

Our Vision & Mission (2/3)

Page 5: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

5

Our mission & Mission (3/3)

Coordinating research initiatives across departments Technology watch and green-field research Performing quantitative analysis Identifying and communicating business opportunities Providing users with sensible options - courses of action Ensuring smooth transition from research to development Communicate practices and experiences Report and advise over technical solutions to decision-makers

Help addressing Challenges

Page 6: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

6

EPO Research The case of Machine Translation

Our Vision & Mission

MT versus Patents

The Chinese language caseOur Experiments

Our Accomplishments

Perspectives

Page 7: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

7

MT versus PatentsA Strategic Domain foreseen 5 years ago

Needs less investment than expected Can re-use existing data and knowledge Mature enough to improve efficiency Satisfies patent professionals Offers a key technology for future language

challenges

Lessons learned from the European Machine Translation Programme

Page 8: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

8

EPO Research The case of Machine Translation

Our Vision & Mission

MT versus Patents

The Chinese language caseOur Experiments

Our Accomplishments

Perspectives

Page 9: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

9

Chinese language case (1/)

Issue 1: Sentence + Word Segmentation Issue 2: Text ReorderingIssue 2: Text Reordering Issue 3: Alignment + System training Issue 4: Translation with proper terms Issue 5: Regeneration

Page 10: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

10

Example: The Re-ordering Issue

[Brown & al. 93] set the foundations of the SMT approach (use of Bayes' theorem)

[Knight 99] approach (Model 3) to word re-ordering does bring in some improvement in the target sentence, but it is rather oriented towards French or English structures.

[Chiang 05] proposes to re-order sentences in Chinese by using hierarchical phrase pairs, which are phrases that contain subphrases. Produce better results than the traditional phrase-based

approach.

Many Years of research on the subject:

Page 11: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

11

The Re-ordering Issue

Re-ordering: the phrase-base approach

"Australia is diplomatic relations with North Korea is one of the few countries"

Page 12: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

12

Step 1

Step 2

Re-ordering :

Hierarchical-phrase approach (1/2)

Page 13: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

13

"Australia is one of the few countries that have diplomatic relations with North Korea".

Step 3

Re-ordering :

Hierarchical-phrase approach (2/2)

Page 14: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

14

Solution?A semi-automatic approach

Computer-Assisted Translation (CAT) Using high-quality manually-aligned texts based on international

organizations bi-text repositories and translation memories. Using a bilingual ontology to align words or phrases which are

not present in the training corpuses. There are available ontologies of patent vocabulary in English; a manual Chinese translation of the central concepts could be

gradually added by IPC category Use syntactic rules to improve lexical choices and collocation

processing. I.e Univ. of Geneva (Chomsky syntactic parser for English) process to guarantee a well-formed final English sentence

Page 15: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

15

EPO Research The case of Machine Translation

Our Vision & Mission

MT versus Patents

The Chinese language caseOur Experiments

Our Accomplishments

Perspectives

Page 16: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

16

Comparison of MT systemAn empirical approach (1/3)

Rule based system (Systran) Statistical system (Language Weaver) Hybrid system (CCID prototype)

1 Evaluation grid

3 systems on the test bench

Scores of 1-4 Usability & Readability criteria

Page 17: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

17

Comparison of MT systems (2/3)

Poor (1) Medium (2) Good (3) Excellent (4)

Rule-based MTHybrid MT ? ???Statistical MT

Page 18: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

18

Comparison of MT systemAn empirical approach (3/3)

No MT system performs properly, CAT (Computer Aided Translation) seems necessary

The hybrid system seems more promising Post-editors needed for checking outputs?

No statistical significance is to be reported - further investigations needed!

Page 19: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

19

Readability Tests on Human Translations: Flesch et al.

Designed to indicate how difficult a reading passage is to understand.

There are two tests: Flesch Reading Ease Flesch–Kincaid Grade Level.

This test has become a standard. Bundled with popular word processing programs

Page 20: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

20

Flesch Reading Ease score : 206.835 – (1.015 x ASL) – (84.6 x ASW)

Rates text on a 100-point scale; the higher the score, the easier it is to understand the document (60 to 70 for standard docs).

Where:ASL = average sentence length (# words / # of sentences)ASW = average number of syllables per word (# syllables / # of words)

Flesch-Kincaid Grade Level score: (.39 x ASL) + (11.8 x ASW) – 15.59

Rates text on a U.S. school grade level. A score of 8.0 means that an eighth grader can understand the document (7.0 to 8.0 for standard docs)

Readability Tests on Human Translations: Flesch et al.

Page 21: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

21

Human Translation assessmentExample (1/2)

CN1926077 The Making and Using Methods of Plant/Soil Activated Liquid

Abstract

In the mineral composition ion water of concentrated sulfuric acid, which add the vegetal leavening confected by enzyme and microbe used to produce enzyme and the muscovado made by sugarcane together, under the aerobic condition, the selective preference is, do the commensalisms cultivation at about 25 Centigrade. After decomposing the sugar, before rot and ferment, the selective preference is, spreading on the leaf surface or pouring in the soil during the alcohol fermenting stage.

Flesch-Kincaid Reading Ease score: 13/100Flesch-Kincaid Grade level: 17.Score: 7/10

Comments: The Abstract and parts of the claims are convoluted/badly structured in parts and some spelling mistakes.

What's Important?Figures or

Comments?

Page 22: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

22

Human Translation assessmentExample (2/2)

CN2354381 Claims 1. A time switch of gas appliances, composing of mechanical gear timer and fuel

gas valve, wherein it also comprises round upper cover board subassembly and lower cover board subassembly, a valve switch knob (4) fixed on the upper end of the valve switch spigot shaft (7) is installed on the front of the upper cover board, the valve switch spigot shaft (7) penetrates through the upper cover board (6) and the lower cover board (29), a timer hollow shaft (8) is installed out of the valve switch spigot shaft (7), the timer hollow shaft (8) penetrates through uthe pper cover board (6), a round time knob (5) is installed between the upper end valve switch knob of the timer hollow shaft and the upper cover board (6), a time indicating dial (3) interlocking with the timer hollow shaft (8) is installed between the round time knob (5) and the upper cover board (6); a mechanical gear timer is installed on the reverse side of the upper cover board (6), an unlocking cam(9) is installed out of the timer hollow shaft (8) in the central part;

Flesch-Kincaid Grade level: 49.Flesch-Kincaid Reading Ease score: -45.Score: 9/10Comments: Long convoluted sentences. Diagrammatical explanations. Minor grammatical and typo errors.

Page 23: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

23

Human vs machine: unfair competition?

One kind to combs the type generator using a phase lock agility frequency modulation output signal to form the output any to designate channel's installment and the method. The track input signal's phase error, this input signal is modulated the carrier output frequency, with should modulate the output frequency, the use subtracts this input signal the method to lock combs the type generator output, and eliminates this phase error

一种利用相位锁定一捷变频率调制输出信号到梳式发生器形成输出的任何选定信道的装置和方法。跟踪输入信号的相位误差,该输入信号被调制成载波输出频率,和该调制过的输出频率,

利用减去该输入信号的方法锁定到梳式发生器输出,并消除该相位误差。

An apparatus and method is disclosed which phase locks a frequency-agile modulated output signal to any selected channel of a comb generated output. The phase error of an input signal is tracked, the input signal is modulated up to a carrier output frequency, and the modulated output frequency is locked to the comb generator output by subtracting the input signal and negating the phase error.

Systran

Human translation

Original text

Is such an MT useful?Is such an MT useful?

Page 24: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

24

EPO Research The case of Machine Translation

Our Vision & Mission

MT versus Patents

The Chinese language caseOur Experiments

Our Accomplishments

Perspectives

Page 25: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

25

Chinese patents showing Priority documents 105000 CN documents with US priorities 15000 CN documents with EP priorities 15000 CN documents with GB priorities 15000 CN documents with EP priorities 400 CN documents with WO priorities

A sufficient source for starting-up an alignment?

# of aligned sentences

Our Accomplishments

(June 2006)

Page 26: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

26

Manual Data cleaningDirty texts generate XML failures

CN86103346

Spherical particles of vinyl resins having high bulk density can be prepared by the suspension polymerization process by using as a dispersant an alkyl hydroxy cellulose having a viscosity of from about 1000 to about 100,000 cps. A suitable dispersant is a hydroxypropyl methyl cellulose polymer having the formula: <IMAGE> +TR <IMAGE> where n is from about 300 to about 1500.

Use of XMLSpy Professional to check text

Page 27: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

27

Methodology of World Alignment

[OCH93]

Page 28: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

28

First Example of alignment

Page 29: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

29

Second example of alignment

Page 30: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

30

TMX Formatting of aligned texts

<?xml version="1.0" ?> <!DOCTYPE tmx SYSTEM "tmx14.dtd"> <tmx version="1.4"> <header creationtoolversion="1.0.0" datatype="plaintext"

segtype="sentence" adminlang="EN-US" srclang="EN" o-tmf="txt" creationtool="MetaReadAlign" >

</header> <body> <tu> <tuv xml:lang="EN"><seg> In a preferred embodiment, a low-band

isolator network, coupled to the antenna element, provides signal isolation between high-band and low-band signal paths during high-band operation.</seg></tuv>

<tuv xml:lang="ZH"><seg> NOT DISPLAYABLE </seg></tuv> </tu>

Provides compatibility to Industry standards

Page 31: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

Evaluation record CN85108669

Welcome EvaluatorX

Save Status Reset

• 100% match

•>70% match

•<50% match

•partial translation

•bad translation

•total mismatch

Radio buttons, multiple entries possible (e.g. partial translation, 100% match), default value "100% match"Entries saved on server

Save status for next time

Transmit EvaluationReset the complete evaluation process (everything gets resetted and lost)

Record Evaluated,Proceed with next

Saves the selected buttons for this record and jump to next record

Evaluated/not evaluated

Record Status

Allows browsing

QUALITY CONTROL PANEL BEFORE ALIGNMENT

Page 32: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

32

EPO Research The case of Machine Translation

Our Vision & Mission

MT versus Patents

The Chinese language caseOur Experiments

Our Accomplishments

Perspectives

Page 33: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

33

Acknowledgments

EPO Staff experts in Research & Development

Jan Mannekens

Betty Yang

CrossLanguage

Metaread

University of Geneva

Questions?

[email protected]

Page 34: IRF Symposium 2007 Vienna, Austria November 8-9, 2007, Mariott Hotel Presentation: Machine Translation Chinese-English Some experiments Dr. Barrou DIALLO,

34

References

Brown & al. 93 Brown, Della Pietra, Mercer: The Mathematics of Statistical Machine Translation: Parameter Estimation, ACL vol.19 no.2, 1993

Kevin Knight: A Statistical MT Tutorial Workbook, April 1999

David Chiang: A Hierarchical Phrase-Based Model for Statistical Machine Translation, Proceedings of the 43rd Annual Meeting of the ACL, 2005