moses, past, present, future hieu hoang xrce 2013

Moses, past, present, future

Hieu HoangXRCE 2013

Timeline

2002 Pharoah decoder, precursor to Moses

2005 Replacement for Pharoah

2006 JHU Workshop extends Moses significantly

since late 2006 Funding by EU projects EuroMatrix, EuroMatrixPlus

2012 MosesCore

What is Moses?

• Only the decoder• Only for Linux• Difficult to use• Unreliable• Only phrase-based• No sparse features• Developed by one person• Slow

Common Misconceptions

Only the decoder– replacement for Pharoah

• Training• Tuning• Decoder• Other– XML Server. Phrase-table pruning/filtering.

Domain adaptation. Experiment management system

Only works on Linux

• Tested on– Windows 7 (32-bit) with Cygwin 6.1 – Mac OSX 10.7 with MacPorts– Ubuntu 12.10, 32 and 64-bit– Debian 6.0, 32 and 64-bit– Fedora 17, 32 and 64-bit– openSUSE 12.2, 32 and 64-bit

• Project files for– Visual Studio– Eclipse on Linux and Mac OSX

Difficult to use• Easier compile and install– Boost bjam – No installation required

• Binaries available for– Linux– Mac– Windows/Cygwin– Moses + Friends

• IRSTLM• GIZA++ and MGIZA

• Ready-made models trained on Europarl

Unreliable• Monitor check-ins• Unit tests• More regression tests• Nightly tests

– Run end-to-end training– http://www.statmt.org/moses/cruise/

• Tested on all major OSes• Train Europarl models

– Phrase-based, hierarchical, factored– 8 language-pairs– http://www.statmt.org/moses/RELEASE-1.0/models/

Only phrase-based model– replacement for Pharoah– extension of Pharaoh

• From the beginning– Factored models– Lattice and confusion network input– Multiple LMs, multiple phrase-tables

• since 2009– Hierarchical model– Syntactic models

No Sparse Features

• Large number of sparse features– 1+ millions– Sparse AND dense features

• Available sparse features

• Different tuning– MERT– Mira– Batch Mira (Cherry & Foster, 2012)– PRO (Hopkins and May, 2011)

Target Bigram Target Ngram Source Word DeletionSparse Phrase Table Phrase Boundary Phrase LengthPhrase Pair Target Word Insertion Global Lexical Model

Developed by one person• ANYONE can contribute

– 50 contributors

‘git blame’ of Moses repository

Kenneth

Heafield

Hieu Hoan

phkoeh

Ondrej Bojar

Barry H

sanmarf

o Kiso

Rico Se

nnrich

wlin12

nicolab

ertoldi

Ales Ta

Colin Cherr

Matous M

achace

Phil Willi

10%15%20%25%30%35%40%

thanks to Ken!!

Decoding

• Multithreaded

• Reduced disk IO– compress intermediate files

• Reduce disk space requirement

Time (mins) 1-core 2-cores 4-cores 8-cores Size (MB)

Phrase-based

60 47(79%)

37(63%)

33(56%)

Hierarchical 1030 677(65%)

473(45%)

375(36%)

Training

What is Moses?Common Misconceptions

• Only the decoder• Only for Linux• Difficult to use• Unreliable• Only phrase-based• No sparse features• Developed by one person• Slow

What is Moses?

• Only the decoder Decoding, training, tuning, server• Only for Linux Windows, Linux, Mac• Difficult to use Easier compile and install• Unreliable Multi-stage testing• Only phrase-based Hierarchical, syntax model• No sparse features Sparse AND dense features• Developed by one person everyone• Slow Fastest decoder, multithreaded training, less

Common Misconceptions

Future priorities

• Code cleanup• MT applications– Computer-Aided Translation– Speech-to-speech

• Incremental Training• Better translation– smaller model– bigger data– faster training and decoding

Code cleanup

• Framework for feature functions– Easier to add new feature functions

• Cleanup– Refactor– Delete old code– Documentation

MT Applications

• Computer-Aided Translation

– integration with front-ends– better user of user-feedback information

MT Applications

• Speech-to-speech

– ambiguous input• lattices and confusion networks

– translate prosody• factored word representation

Incremental Training

• Incremental word alignment• Dynamic suffix array• Phrase-table update

• Better integration with rest of Moses

Smaller files

• Smaller binary – phrase-tables– language models

• Mobile devices• Fits into memory

faster decoding!

• Efficient data structures– suffix arrays– compressed file formats

Better Translations• Consistently beat phrase-based models for

every language pairPhrase-Based Hierarchical

en-es 24.81 24.20

es-en 23.01 22.37

en-cs 11.04 10.93

cs-en 15.72 15.68

en-de 11.87 11.62

de-en 15.75 15.53

en-fr 22.84 22.28

fr-en 25.08 24.37

zh-en 27.46 23.91

ar-en 47.90 46.56

The End

moses, past, present, future hieu hoang xrce 2013

training slide

decoding slide

europarl slide

mosescore slide

moses repository slide

rest of moses slide

person slow slide

mac osx slide

Documents

young marketers elite rtd tea campaign hieu hieu [autosaved]

mixed syntax translation hieu hoang mt marathon 2011 trento

model driven architecture une renaissance en cours 1.0 -...

data hoang

hoang suffixtrees

hieu nguoi

assignment tet - minh quang, thai hoang, minh hoang, hoang...

young marketers elite rtd tea campaign hieu hieu

taus moses industry roundtable 2014, changes in moses, hieu...

thesis hieu

6:30 p.m. (bilingüe) en la iglesia · pdf file18125 sherman...

trại%hè hÀnh$trÌnh$ƯỚc$mƠ$2014$ - scdi.org.vn ·...

ogdc2012 art style in game hang rong_mr. hieu, luc hoang

ym2 ass 71 - thuy tien hieu hieu

profile tu van thuong hieu -...

tao hieu hieu ung

duong hieu

ky - udn.vnute.udn.vn/lichtuan/162.pdf1 hoang dung ph6...

bai tap thuong hieu - nhom 10- vu minh hieu

taus mt showcase, moses past, present and future, hieu...