philipp koehn is professor of computer science at johns ... · philipp koehn chief scientist...

Philipp KoehnChief ScientistOmniscien Technologies

Professor of Computer ScienceJohns Hopkins University

Philipp Koehn is Professor of Computer Science at Johns Hopkins University andChief Scientist for Omniscien Technologies. He also holds the Chair for MachineTranslation in the School of Informatics at the University of Edinburgh. Philipp is aleader in the field of statistical machine translation research with over 100publications. He is the author of the seminal textbook in the field. Under hisleadership the open source Moses system became the de-facto standard toolkit formachine translation in research and commercial deployment.

Philipp led international research projects such as Euromatrix and CASMACAT.Philipp's research has been funded by the European Union, DARPA, Google,Facebook, Amazon, Bloomberg and several other funding agencies.

Philipp received his PhD in 2003 from the University of Southern California and wasa postdoctoral research associate at MIT. He was a finalist in the European PatentOffice's European Inventor Award in 2013 and received the Award of Honor fromthe International Association of Machine Translation in 2015.

At Omniscien Technologies Philipp has refined machine translation technology foruse in real-world deployments and helped to develop methods for data acquisitionand refinement. Philipp continues to drive innovation and technologicaldevelopment at Omniscien Technologies.

AI, MT and Language Processing Symposium

Philipp KoehnChief ScientistOmniscien Technologies

Professor of Computer ScienceJohns Hopkins University

The recent trend of using deep learning to solve a wide variety of problems inArtificial Intelligence has also reached machine translation - thus establishing a newstate-of-the-art approach for this application. This approach is not yet settled byany means. New neural architectures are proposed and ideas coming from suchdiverse fields as computer vision, game playing, and speech recognition can beapplied to machine translation as well.

At the practical end, we are just learning about the deployment challenges of thistechnology, since old methods, for example, to integrate terminology databases ordomain adaptation no longer apply.

This presentation will give an overview of the latest developments in research andwhat this means for practical deployment.

AI, MT and Language Processing Symposium

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Facebook.com/omniscien @omniscientech Omniscien Technologies [email protected]

Research in Translation –What is Exciting and Shows Promise Ahead?

Philipp Koehn

5Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Overview

• Evolution of Machine Translation

• Deep Learning

• Neural Machine Translation

• Challenges

• Looking Forward


Evolution of Machine Translation


Machine Translation Paradigms

• Various Approaches• Rule-based (1970s)

• Word-based (1990s)

• Phrase-based (2000s)

• Syntax-based (2010s)

• Neural-based (2016+)

Source Target

Interlingua

Semantic Transfer

Syntax Transfer

Lexical Transfer


Hype and Reality


Better Machine Learning

• Probabilistic models (1990s)

• Increased use of machine learning (2000s)

• Neural networks (since mid 2010s)


Deep Learning


Two Objectives

Fluency

• Translation must be fluent in the target language

• Need model that assigns a language score to each sentence

Adequacy

• Translation must have same meaning as source sentence

• Need model that assigns a translationscore to each sentence


Learning from Data

• Detect patterns in aligned segment pairs


Machine Learning

• Key to success• Analyze problem

• Feature engineering

• For instance: machine translation• What features are relevant for word order?

• What features are relevant for lexical translation?

Input

Features

Output


Neural Learning

• Promise: no more feature engineering

• Several steps of processing features automatically discovered

Input

Hidden

Output


Deep Learning

• More layers

• More complex

feature interactionsHidden

Hidden

Output

Input

Hidden


Neural Machine Translation


word2vec

• Task: Predict word in the middle


Neural Network Solution

• Learn mapping with a neural network


Map Word to Embedding

• Vector representation of word

• Mathematically: • a matrix multiplication

• followed by an non-linear activation function


Visualizing Neural Relationships and Features

Relationships are built much like the human brain.Collections of concepts and vocabulary.



Distance indicates closeness of relationships.Groupings are formed.



Groups are directly and indirectly interrelated.i.e. Sports + Broadcasting and Entertainment


Neural Machine Translation

• Recall: two models

• Language model

… to ensure fluent output

• Translation model

… to ensure adequate translations


Language Models

• Sequential language models:

predict the next word

I …


Language Models



I like ....


Language Models



I like to ...


Language Models



I like to learn ...


Language Models



I like to learn about …


Language Models



I like to learn about machine …


Language Models



I like to learn about machine translation .


Recurrent Neural Language Model

Predict

the first word

of a sentence

same as before,

just drawn top-down

<s>

the

Given word

Embedding

Hidden state

Predicted word



Predict

the second word

of a sentence

Re-use hidden state

from

first word prediction

<s>

the

the

house

Given word

Embedding

Hidden state

Predicted word



<s>

the

the

house

house is big .

is big . </s>

Given word

Embedding

Hidden state

Predicted word


Encoder Decoder Model

• We predicted the words of a sentence

• Why not also predict their translations?


Encoder Decoder Model

<s>

the

the

house

house is big .

is big . </s>

Given word

Embedding

Hidden state

Predicted word

</s>

das

das

Haus

Haus ist groß .

ist groß . </s>

• Obviously madness

• Proposed by Google (Sutskever et al. 2014)


Attention Mechanism

• What is missing?

• Alignment of source words to target words

• Solution: attention mechanism


Neural Machine Translation, 2016

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Neural Machine Translation, 2016

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words

• State of the art

• Used by Google, WIPO, Systran, Omniscien…


Input Sentence

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Encode with Word Embeddings

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Output Sentence

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Each Word Predicted by Embedding

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Embedding Predicted from Input Context

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Input Context Selected By Word Alignment

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Input Context: Weighted Sum of Input Embeddings

Input Word

Embeddings

Left-to-Right

Recurrent NN

Right-to-Left

Recurrent NN

Alignment

Input Context

Hidden State

Output Words


Benefits

• Each output predicted from• encoding of the full input sentence

• all previously produced output words

• Word embeddings allow generalization• “cat” and “cats” have similar representation

• “house” and “home” have similar representations


WMT 2016 Evaluation (News, English-German)

Neural MT

Statistical MT


Challenges


Benefits of Neural Machine Translation

• Evidence of overall better translation quality

• Ability to better generalize training data

• Better handling sentence-level context

• Better fluency


Neural Machine Translaton is Data-Hungry

Phrase Based SMT with Big Language Model

BLE

U S

core

Corpus Size 10,000,000100,000

100,000,0001,000,000

1,000,000,00010,000,000

30

20

10

0

Phrase Based SMT

Neural MT

WordsSentences


Neural Machine Translation Failures


Adequacy or Fluency?

• Language model may take over

• Output unrelated to input


Fluency vs. Adequacy Errors

• Input

Ich will Kuchen essen

• Fluency error (more common in SMT)

I want cake eat

• Adequacy error (more common in NMT)

I want to cook chicken


Limited Vocabulary

• Words are encoded in highly dimension vector

• Only allows for limited vocabulary size• words are split into subwords

• maybe even split into characters?

• fall-back to dictionaries / phrase-based models


NMT More Susceptible to Noisy Training Data

• More harmed by• Alignment errors

• Bad language

• Wrong language on target side

• Severely harmed by un-translated source text (over-learns to copy)

• Data cleaning more important


NMT is Worse Out-of-Domain

• In nearly all cases, SMT was better than NMT when content was out of domain.

• More data is required for NMT to meet domain specific needs

• When sufficient data is available, NMT usually will be better than NMT for typical sentences


Deployment Challenges for Neural MT

• Speed• training takes weeks

• decoding slower than traditional SMT

• Hardware requirements

• GPUs needed ($ 2’000 each)

• Google even has specialized hardware

• Process is not transparent

• Practically impossible to find out “why wrong?”

• Mistakes cannot be easily fixed


Neural Machine Translation – A Mystery?

• Decisions of statistical often hard to understand

• Neural: even harder

input MAGIC output

• New studies reveal inner workings• Attention mechanism

• Word sense disambiguation


Attention States

• Attention mechanism plays role of “word alignment”

• “Soft alignment”: distributed over several input words


Word Sense Disambiguation

Deep embedding

of the word “right”

in encoder


NMT vs SMT: What We Know By Now

• In ideal conditions, NMT much better

• Different types of error (fluency vs. adequacy)

• NMT more susceptible to noise

• NMT less robust (out-of-domain, low-resource, etc.

=> Hybrid approach of Omniscien Technologies


Looking Forward


Attention Sequence-to-Sequence Model

• Based on recurrent neural networks

• Attention mechanism (alignment)

• Standard Approach 2015-2017


Deeper Models

• More layers in encoder and decoder

• Models more complex relationships between words

• Significantly higher performance


Google’s “Transformer” Model

• Self-attention

• Encoder: Input words inform each other

• Decoder: Attention on some previous output words


Facebook’s Convolutional Model

• Hierarchical (“convolutions”) instead of sequential

• Faster (but more limited context)

• In encoder and decoder


Synthesizing Data

• Neural machine translation trained on parallel data

• Improve with monolingual data• Back-translate target language text into source language

• Add as training data

• Can be iterated (“dual learning”)


Domain Adapted Models

• Various techniques explored for cutomization

• One simple effective method• Train general system on all available data

• Fine-Tune on in-domain data


Terminology

• Terminology, brand names with fixed translations

Der neue Neurolierer XVQ-72 ist lieferbar.

Neurolizer XVQ-72

• XML markup

Der neue <x translation=“Neurolizer XVQ-72”>

Neurolierer XVQ-72</x> ist lieferbar.

• Use attention states to detect insertion point


Dynamic Software Environment

• Major players released deep learning frameworks• Tensorflow (Google)

• pyTorch (Facebook)

• MX-Net (Amazon)

• Theano framework discontinued development

• Also: dedicated NMT implementations (faster)

• Quick turn-around from research into deployment


Hardware Developments

New GPUs from NVIDIA in 2018

• Faster, more memory

• Enable deeper models


Facebook.com/omniscien @omniscientech Omniscien Technologies [email protected]

Research in Translation –What is Exciting and Shows Promise Ahead?

Philipp Koehn

philipp koehn is professor of computer science at johns ... · philipp koehn chief scientist...

Documents