tapta4ipc: helping translation of ipc definitions bruno pouliquen...
TRANSCRIPT
Tapta4IPC: helping translation of IPC definitions
Bruno Pouliquen ([email protected])
25 feb 2013, IPC workshop
Translation assistant for patent titles and abstracts in PATENTSCOPE - potential use in translating IPC definitions collaboration
Statistical Machine Translation: bottom-up approach
no rules, no grammar, no dictionary, no terminology, only the parallel texts (bitexts)
We use an open-source system: Moses
Tapta: Translation of Patent Titles and Abstract• Originally built to translate patent applications• Adapted to various applications
Introduction
data
system
Our system prepares the data for Moses, apply some post-processing (filter, pruning, binarization, optimization…) and offers a Web interface to translate
Tapta framework
clean re-cleantrain-model
post-filter prune binarize optimize Publish
sourcelanguage
Bitexts
Gather/convert data
targetlanguage
Introduction: Tapta
In WIPO, as part of Patentscope (English,French,German,Chinese,Japanese)
eg. http://patentscope.wipo.int/translate/simpleTranslate.jsf?id=JP75694586&langpair=jaen
Automatic translation of a patent application only available in Japanese…
In United Nations (English from/into Arabic,French,Spanish,Russian & Chinese)
Technical workflow
Moses’ training
phrase table
reordering model
Moses decoder Moses decoder Moses decoder
Translationserver
En Es
Strengthening of forum for human dignity : legal aid
Fortalecimiento del foro para la dignidad humana – asistencia jurídica
must respect all aspects of human dignity
debe respetar todos los aspectos de la dignidad humana
should fully respect human dignity
se deben respetar plenamente la dignidad humana
Translationclient
language model
Filter align.
Tokenization
Score alignment
Filter wrong language
Sentence-split
Sentence-align
Filter align.
Filter wrong language
Bitexts aligned at sentence level
sourcelanguage
Bitexts
targetlanguage
IPC context
• Gather data:– Get existing definitions – Add IPC schema (xml on WIPO website)– Add “few” texts from patents
• “learn” translation model• Translate new texts
Get existing data, build parallel texts
<ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="EN"><textBody> <title> <titlePart> <text>Wheel guards</text></titlePart></title></textBody></ipcEntry>
WO/2013/014517(EN) TYRE FOR VEHICLE WHEELS(FR) PNEUMATIQUE POUR ROUES DE VÉHICULE
IPC schema…
Patent texts…
<ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="FR"><textBody> <title><titlePart> <text>Couvre-roues</text> </titlePart></title></textBody></ipcEntry>
Wheels roues
Wheel guards Couvre-roues
Tyre for vehicle wheels Pneumatique pour roues de véhicule
Existing definitions…
Bitext: training material…
How well it works?
Automatic evaluation: BLEU score
Principle : similarity of n-grams between evaluated and reference sentences
On IPC definition English-French: bleu=48%
(without patent data: 44%)
Good quality
needs human post-editing
Tapta4IPC prototype (1)
Live demo using:http://patentscope.wipo.int/translateUN/translateIPC.jsf
Conclusion / future work
This is a prototype, but the quality looks already acceptable
Human evaluation?
Better integrate the tool
In PCA6TRANSDEF ?
Other languages?
Tapta4IPC in various languages
Tapta4IPC should work reasonably well on the following languages (we have built some language specific tools and we have patent corpora):
• German• Japanese• Korean• Spanish• Dutch • Portuguese• Chinese• Russian
More challenging:• Czech, Slovak, Polish (many word forms, training corpus?)• Estonian (even more word forms, would in theory require more
training corpus)
Other languages: Arabic, Italian, Danish, Swedish etc.
Thank you for your attention
اهتمامكم على لكم شكراMerci pour votre attention!
感谢您的关注Grazie per la vostra attenzione!¡ Gracias por su atención !Vielen Dank für Ihre Aufmerksamkeit! Obrigado pela vossa atenção!Dziękuję bardzo za Państwa uwagę! Děkujeme za Vaši pozornost!Ďakujem ti veľmi pekne za tvoju
pozornosť Tänan tähelepanu eest!Благодарим за Вашето внимание!Tak for Jeres opmærksomhed!Thank you for your attention!