neural machine translation: a machine learning...

42
Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal Researcher, Microsoft Research IEEE Fellow, ACM Distinguished Member

Upload: others

Post on 21-Jul-2020

51 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Neural Machine Translation: A Machine Learning Perspective

Tie-Yan Liu

Principal Researcher, Microsoft Research

IEEE Fellow, ACM Distinguished Member

Page 2: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Neural Machine Translation

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 2

Page 3: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Neural Machine Translation

Encoder: from input word sequence to intermediate context

Decoder: from intermediate context to distribution of output word sequence

FNN

CNN

RNN

Various choices of implementing the encoder or decoder

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 3

Page 4: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Neural Machine Translation

• Example: RNN-based implementation

• Attention mechanism• Using personalized context vector 𝑐𝑡 = σ𝑗=1

𝑇𝑥 𝛼𝑡𝑗ℎ𝑗 ,

where 𝛼𝑡𝑗 is importance of 𝑥𝑗 is to 𝑦𝑡

(Bengio, ICLR 2015)

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 4

Page 5: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Fast Development of NMT - GNMT

• RNN as encoder/decoder• Stacked LSTM-RNN (8 layers for encoder and decoder respectively)

• Each layer is trained on a separate GPU for speed-up

• Standard attention model

• Residual connection for better gradient flow

• Significant improvement over shallow models• 39.92 vs. 31.3 (Bengio, ICLR 2015) on En-Fr

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 5

Page 6: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Fast Development of NMT – ConvS2S

• CNN as encoder/decoder• Convolutional block structure

• Gated linear unit + Residual connection

• 15 layers for encoder and decoder respectively

• Multi-step attention • Separate attention mechanism for each

decoding layer

• Comparable to (slightly better than) RNN-based NMT models• 40.46 vs. 39.92 (GNMT) on En-Fr

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 6

Page 7: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Fast Development of NMT - Transformer

• FNN as encoder/decoder• 6 layers (each with two sub-layers) for

encoder and decoder respectively

• Relying entirely on attention (including multi-head self-attention) to draw global dependency between input and output

• Comparable to (slightly better than) RNN-based and CNN-based NMT models• 41.0 vs. 40.46 (ConvS2S) vs. 39.92 (GNMT) on En-Fr

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 7

Page 8: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Fast Development of NMT- Summary

Algorithms Framework Algorithm #layers:

encoder-decoder

English->French (36M pairs)

English->German (4.5M pairs)

BLEU Training cost BLEU Training cost

Bengio, ICLR 2015

Theano(open source)

GRU-RNN 1-1 31.3 - - -

GNMTTensorFlow(no code)

LSTM-RNN 8-8 39.9296 K80, 6 days

24.6 -

TransformerTensorFlow

(open source)FNN+

attention12-12 41.0

8 P100,4.5 days

28.48 P100

3.5 days

ConvS2STorch

(open source)CNN 15-15 40.46

8 M40, 37 days

25.161 M40

18.5 days

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 8

Page 9: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

What’s Done?

• These works verified the strong representation power of deep neural networks:• No matter FNN, CNN, or RNN, all can be used

to fit bilingual training data, and achieve good translation performance when sufficiently large training data are given.

• However, this is not surprising at all• Already indicated by the universal

approximation theorem, decades ago.

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 9

Page 10: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

What’s Missing?

• Many unique challenges of machine translation have not been addressed• Relying on huge amount of bilingual

training data

• Relying on myopic beam search during inference

• Using likelihood maximization for both training and inference, which differs from true evaluation measure (BLEU)

• … …

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 10

Page 11: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Leveraging Reinforcement Learning to Tackle these Challenges

Dual learning• Leveraging the symmetric structure of

machine translation to enable effective learning from monolingual data through reinforcement learning.

Predictive inference• Using end-to-end BLEU as delayed reward

to train value networks

• Using value networks to guide forward-looking search along the decoding tree

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 11

Page 12: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Dual Learning for NMTNIPS 2016, IJCAI 2017, ICML 2017

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 12

Page 13: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Traditional Solutions to Insufficient Training Data

Tie-Yan Liu @ Microsoft Research Asia 13

Label propagation Transductive learning

Multi-task learning Transfer Learning

7/25/2017

Page 14: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

A New View: The Beauty of Symmetry

• Symmetry is almost everywhere in our world, and also in machine translation!

Tie-Yan Liu @ Microsoft Research Asia 14

Hello! 你好!

7/25/2017

Page 15: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Dual Learning

• A new learning framework that leverages the symmetric (primal-dual) structure of AI tasks to obtain effective feedback or regularization signals to enhance the learning process, especially when lacking labeled training data.

Tie-Yan Liu @ Microsoft Research Asia 157/25/2017

Page 16: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Dual Learning for Machine Translation

Tie-Yan Liu @ Microsoft Research Asia 16

English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑥′; 𝑓, 𝑔 : BLEU of 𝑥′ given 𝑥; • 𝐿(𝑦; 𝑓), 𝐿(𝑥′; 𝑔): likelihood and syntactic

correctness of 𝑦 and 𝑥’; • 𝑅 𝑥, 𝑦; 𝑓 , 𝑅 𝑦, 𝑥′; 𝑔 : dictionary based

translation correspondence, etc.

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Environment

Agent

Environment

Agent

Ch->En translation

En->Ch translation

Policy gradient is used to improve both primal and dual models according to feedback signals

(NIPS 2016)

7/25/2017

Page 17: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Experimental Setting

• Baseline: • State-of-art NMT model, trained

using 100% bilingual data• Neural Machine Translation by Jointly

Learning to Align and Translate, by Bengio’s group (ICLR 2015)

• Our algorithm:• Step 1: Initialization

• Start from a weak NMT model learned from only 10% training data

• Step 2: Dual learning• Use the policy gradient algorithm to

update the dual models based on monolingual data

Tie-Yan Liu @ Microsoft Research Asia 17

NMT with 10%bilingual data

Dual learning with10% bilingual data

NMT with 100%bilingual data

BLEU score: French->English

↑ 0.3

↓ 5.0

Starting from initial models obtained from only 10% bilingual data, dual learning can achieve similar accuracy as the NMT model learned from 100% bilingual data!

7/25/2017

Page 18: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Probabilistic Nature

• The primal-dual structure implies strong probabilistic connections between the two tasks.

• This can also be used to improve supervised learning, and perhaps even inference• Structural regularizer to enhance supervised learning• Additional criterion to improve inference

Tie-Yan Liu @ Microsoft Research Asia 18

𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔

Primal View Dual View

7/25/2017

Page 19: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

“Dual” Supervised Learning

Tie-Yan Liu @ Microsoft Research Asia 19

Labeled data 𝑥Predicted label

𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |: the gap between

the joint probability 𝑃(𝑥, 𝑦) obtained in two directions

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Environment

Bob

Environment

Alice

min |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔 |

max log𝑃(𝑦|𝑥; 𝑓)

max log𝑃(𝑥|𝑦; 𝑔)

(ICML 2017)

7/25/2017

Page 20: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Experimental Results

Tie-Yan Liu @ Microsoft Research Asia 20

Theoretical Analysis• Dual supervised learning generalizes better than standard supervised learning

The product space of the two models satisfying probabilistic duality: 𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 = 𝑃 𝑦 𝑃 𝑥 𝑦; 𝑔

En->Fr Fr->En En->De De->En

NMT Dual learning↑2.1↑0.9

↑1.4↑0.1

7/25/2017

Page 21: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

“Dual” Inference

Tie-Yan Liu @ Microsoft Research Asia 21

Test data 𝑥Predicted label

𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Environment

Bob

Environment

Alice

𝑃 𝑦 𝑥 =𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥

Choose the 𝑦 that can maximize 𝑃(𝑦|𝑥; 𝑓)Standard inference

Choose the 𝑦 that can maximize both 𝑃(𝑦|𝑥; 𝑓) and𝑃 𝑦 𝑃(𝑥|𝑦)

𝑃(𝑥)

Dual inference: leverage both the primal model and the dual model for testing

(IJCAI 2017)

7/25/2017

Page 22: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Experimental Results

Tie-Yan Liu @ Microsoft Research Asia 22

• Dual inference has generalization guarantee although training and inference become a little inconsistent.

The generalization bound for dual inference is comparable to that of standard inference.

Theoretical AnalysisEn->Fr Fr->En En->De De->En

NMT Dual learning↑0.5↑0.4

↑1.2 ↑0.5

7/25/2017

Page 23: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Inference with Predicted Reward

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 23

Page 24: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Standard Inference Process in NMT

• Beam search + likelihood maximization

emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

喜欢 中国

中国

Decoder

喜欢 中国

中国

At each step, select the top-𝑘 words with the largest translation probability

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 24

Page 25: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Standard Inference Process in NMT

• Beam search + likelihood maximization

emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

喜欢 中国

中国

Decoder

喜欢 中国

中国

At each step, select the top-𝑘 words with the largest translation probability

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 25

Page 26: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Standard Inference Process in NMT

• Beam search + likelihood maximization

emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

喜欢 中国

中国

Decoder

喜欢 中国

中国

At each step, select the top-𝑘 words with the largest translation probability

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 26

Page 27: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Inference Errors

• Likelihood ≠ BLEU!

• Myopic local search ≠ global optimum

emb

LSTM/GRUc

Encoder

I

emb

LSTM/GRU

Iove

emb

LSTM/GRU

China

喜欢 中国

中国

Decoder

喜欢 中国

中国

他We need a method to predict long-term rewards (e.g., BLEU) during inference

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 27

Page 28: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Inspired by AlphaGo

CNN

Value NetworkPolicy Network

h1 h2 h3 htx

c1 c2 c3

r3r2r1

x1 x2 x3 xtx

Attention

NMT Model

Word Probability

h1 h2 h3 htx

c1 c2 c3

r3r2r1

x1 x2 x3 xtx

y1 y2 y3

…Future Bleu

Attention

Value Network

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 28

Page 29: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Value Networks to Predict Long-term Reward

• Value function in NMT• Value function 𝑣𝜋(𝑥, 𝑦<𝑡) estimates the (delayed) BLEU score of the final

translation for source sentence 𝑥, if we continue to decode from the partially decoded sentence 𝑦<𝑡 according to NMT model 𝜋.

• Information for estimating long-term BLEU score

𝑣𝜋(𝑥, 𝑦<𝑡)

Semantic correlation between source 𝑥 and partially decoded target 𝑦<𝑡

Effectiveness/coverage of the attention mechanism between encoder and decoder

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 29

Page 30: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Design of Value Networks

h1 h2 h3 htx

c1 c2 c3

r3r2r1

x1 x2 x3 xtx

y1 y2 y3

…SM

Module

CC Module

Semantic Matching (SM) Module

r3r2r1…

Attention

Context-Coverage (CC) Module

c1 c2 c3 htxh2h1… …

Mean pooling Mean pooling

usm

Mean pooling Mean pooling

ucc

htxh2h1

Compute the semantic correlation between source and target sentences based on the mean pooling of the encoder/decoder hidden states

Compute the effectiveness/coverage of the attention mechanism based on the mean pooling

of the encoder hidden states and the contexts7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 30

Page 31: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Training of Value Networks

• Training data• Generated by Monte-Carlo method

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 31

Randomly pick a source sentence 𝑥from the original training dataset

Generate 𝑦<𝑡 for 𝑥using 𝜋 with

random selected 𝑡.

Generate 𝐾complete

translations for each 𝑦<𝑡 using 𝜋.

Compute the BLEU score for each

translation using the ground truth sentence 𝑦

Use the averaged BLEU score as the

labelled value for 𝑣𝜋(𝑥, 𝑦<𝑡)

Page 32: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Training of Value Networks

• Pairwise ranking loss minimization• For two partial sentences 𝑦𝑝,1 and 𝑦𝑝,2 for each 𝑥, where 𝑦𝑝,1 has a larger

BLEU score than 𝑦𝑝,2, we define the following loss function:

• Why not directly optimizing BLEU score?• BLEU score is computed according to 𝑛-gram precision (𝑛=1,2,3,4): the

regression of BLEU scores is more sensitive than pairwise classification (which just needs to differentiates good and bad candidates).

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 32

Page 33: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Experimental Results

• Baselines• BS: standard beam search

• BSO: beam search guided by predicted instant (but not delayed) BLEU score for the partial decoding result

• Observations• Our proposed approach is

consistently better than the baselines

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 33

Page 34: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

More Challenges to NMT

Black magics in algorithm tuning

High computational load

Latent semantics in texts

Greedy one-pass decoding

Decreased diversity in tech roadmap

Insufficient emphasis on teaching

Little Attention on the beauty of translation

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 34

Page 35: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Black Magics in Algorithm Tuning

• Hyper-parameter tuning• Structure: #layers, #nodes, activation types, skip connection, attention, …

• Learning rate, momentum, initialization, drop out, batch normalization, …

• Unreliable results• Many published results cannot be readily

reproduced

• Good empirical performances might result from “overfitting” the test data (especially considering that test data for NMT is too small to be statistically robust)

Tie-Yan Liu @ Microsoft Research Asia 357/25/2017

Page 36: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

High Computational Load

• Current NMT models are clumsy and require long training time and huge computational power• GNMT: 96 GPUs for one week (WMT En→Fr)

• ConvS2S: 8 GPUs for 5+ weeks (WMT En→Fr)

• Lightweight NMT is desirable• Training NMT model in one hour!

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 36

Page 37: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Latent Semantics in Texts

• Whole != Sum of its parts• 杀鸡取卵≠ “杀”+“鸡”+“取”+“卵”;

• Other examples: 春风化雨、登堂入室、饮鸩止渴、… …

• Almost mission-impossible for low-frequency phases (no enough contexts), if we use statistical learning only.

• Semantic NMT is sorely needed• Combination with linguistics-based methods or external dictionary

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 37

Page 38: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Greedy One-pass Decoding

• One-pass sequential inference can hardly be optimal• 僧推月下门 vs. 僧敲月下门• 红楼梦:披阅十载、增删五次,字字都是血、十年不寻常

• Multi-round refinement in decoder will be better

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 38

Page 39: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Decreased Diversity in Tech Roadmap

• Deep neural networks are banishing other types of learning algorithms• Google’s MultiModel is used to address different workloads

• Even for small-sample problems, people tend to consider DNN as their first choice

• Other technologies such as SVM and Bayesian networks are gradually ignored

• Richness and diversity are necessary to ensure healthy development of science and technology• Investigation on non-DNN algorithms for NMT

Tie-Yan Liu @ Microsoft Research Asia 397/25/2017

Page 40: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Insufficient Emphasis on Teaching

• Learning vs. Teaching• “没有教不好的学生,只有不会教的老师”

• “因材施教、教学相长”• Today, almost all efforts are

put on “how to learn”, but little on “how to teach”!

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 40

Page 41: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

Little Attention on the Beauty of Translation

• 译事三难:信、达、雅。• “信”指意义不背原文,译文准确,不歪曲,不遗漏,也不要随意增减意思• “达”指不拘泥于原文形式,译文通顺明白• “雅”则指译文时选用的词语要得体,追求文章本身的古雅,简明优雅

• How to appropriately model “雅” in the training process?• 文化、习俗、典故、工整、艺术、新颖… …

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 41

Page 42: Neural Machine Translation: A Machine Learning Perspectivetcci.ccf.org.cn/summit/2017/dlinfo/06.pdf · Neural Machine Translation: A Machine Learning Perspective Tie-Yan Liu Principal

[email protected]

http://research.microsoft.com/users/tyliu/

7/25/2017 Tie-Yan Liu @ Microsoft Research Asia 42