fundamentals of machine learning for machine translation · fundamentals of machine learning for...

Fundamentals of Machine Learningfor Machine Translation

Dr. John D. KelleherADAPT Centre for Digital Content Technology1

Dublin Institute of Technology, Ireland

Translating Europe Forum 2016Brussels, Charlemagne Building, Rue de la Loi 170

27th October 2016

1The ADAPT Centre is funded under the SFI Research Centres Programme(Grant 13/RC/2106) and is co-funded under the European RegionalDevelopment Fund

1 / 28

Outline

Basic Building Blocks: Neurons

What is a Neural Network?

Word Embeddings

Recurrent Neural Networks

Encoders

Decoders (Language Models)

Neural Machine Translation

Conclusions

What is a function?

A function maps a set of inputs (numbers) to an output (number)

sum(2, 5, 4)→ 11

3 / 28

What is a weightedSum function?

weightedSum([n1, n2, . . . , nm]︸︷︷︸Input Numbers

, [w1,w2, . . . ,wm]︸︷︷︸Weights

)

= (n1 × w1) + (n2 × w2) + · · ·+ (nm × wm)

weightedSum([3, 9], [−3, 1])

= (3×−3) + (9× 1)

= −9 + 9

= 0

4 / 28

What is an activation function?

An activation function takes the output of ourweightedSum function and applies another mapping to it.

−10 −5 0 5 10

0.00

0.25

0.50

0.75

1.00

x

logistic(x)

5 / 28

What is an activation function?

activation =

logistic(weightedSum(([n1, n2, . . . , nm]︸︷︷︸Input Numbers

, [w1,w2, . . . ,wm]︸︷︷︸Weights

))

logistic(weightedSum([3, 9], [−3, 1]))

= logistic((3×−3) + (9× 1))

= logistic(−9 + 9)

= logistic(0)

= 0.5

6 / 28

What is a Neuron?

The simple list of operations that we have just described definesthe fundamental building block of a neural network: the Neuron.

Neuron =

activation(weightedSum(([n1, n2, . . . , nm]︸︷︷︸Input Numbers

, [w1,w2, . . . ,wm]︸︷︷︸Weights

))

7 / 28

What is a Neural Network?

HiddenLayer

OutputLayer

InputLayer

I1

H2

H3

H1

I2

O1

O2

8 / 28

Where do the weights come from?

OutputLayer

HiddenLayer

InputLayer

I1

H1w1

H2

w2

H3

w3I2

w4

w5

w6

O1

w7

O2

w10

w8

w11

w9

w12

9 / 28

Training a Neural Network

I We train a neural network by iteratively updating the weights

I We start by randomly assigning weights to each edge

I We then show the network examples of inputs and expectedoutputs and update the weights using Backpropogationso that the network outputs match the expected outputs

I We keep updating the weights until the network is workingthe way we want

10 / 28

Word Embeddings

Each word is represented by a vector of numbers that positions theword in a multi-dimensional space, e.g.:

king =< 55,−10, 176, 27 >

man =< 10, 79, 150, 83 >

woman =< 15, 74, 159, 106 >

queen =< 60,−15, 185, 50 >

11 / 28

Word Embeddings

vec(King)− vec(Man) + vec(Woman) ≈ vec(Queen)2

2Linguistic Regularities in Continuous Space Word Representations (Mikolov et al., 2013)

12 / 28


A particular type of neural network that is useful for processingsequential data (such as, language) is a Recurrent NeuralNetwork.

Using an RNN we process our sequential data one input at a time.

In an RNN the outputs of some of the neurons for one input arefeed back into the network as part the next input.

13 / 28


InputLayer

OutputLayer

MemoryBuffer

HiddenLayerI1

H1

H2

H3

M1

M2

M3

O1

14 / 28


HiddenLayer

MemoryBuffer

OutputLayer

InputLayer

I1

H2

H3

H1

M2

M3

M1

O1

15 / 28


OutputLayer

HiddenLayer

MemoryBuffer

InputLayer

I1

H2

H3

H1

M2

M3

M1

O1

16 / 28


HiddenLayer

OutputLayer

InputLayer

MemoryBuffer

I1

H2

H3

H1

M2

M3

M1

O1

17 / 28


Output

Hidden

InputMemory

Figure: Recurrent Neural Network

yt

ht

xtht−1

Figure: Recurrent Neural Network

18 / 28


Output:

Input:

y1 y2 y3 yt yt+1

h1 h2 h3 · · · ht ht+1

x1 x2 x3 xt xt+1

0 / 0

Figure: RNN Unrolled Through Time

19 / 28


1. RNN Encoders

2. RNN Language Models

20 / 28

Encoders

Encoding:

Input:

h1 h2 · · · hm C

Word1 Word2 Wordm < eos >

0 / 0

Figure: Using an RNN to Generate an Encoding of a Word Sequence

21 / 28

Language Models

Output:

Input:

⇤Word2 ⇤Word3 ⇤Word4 ⇤Wordt+1

h1 h2 h3 · · · ht

Word1 Word2 Word3 Wordt

0 / 0

Figure: RNN Language Model Unrolled Through Time

22 / 28

Decoder

Output:

Input:

⇤Word2 ⇤Word3 ⇤Word4 · · · ⇤Wordt+1

h1 h2 h3 · · · ht

Word1

0 / 0

Figure: Using an RNN Language Model to Generate (Hallucinate) aWord Sequence

23 / 28

Encoder-Decoder Architecture

Decoder

Encoder

Target1 Target2 · · · < eos >

h1 h2 · · · C d1 · · · dn

Source1 Source2 · · · < eos >

0 / 0

Figure: Sequence to Sequence Translation using an Encoder-DecoderArchitecture

24 / 28

Neural Machine Translation

Decoder

Encoder

Life is beautiful < eos >

h1 h2 h3 h4 C d1 d2 d3

belle est vie La < eos >

0 / 0

Figure: Example Translation using an Encoder-Decoder Architecture

25 / 28

Conclusions

I An advantage of the encoder-decoder architecture is thatthe system processes the entire input before it startstranslating

I This means that the decoder can use what it has alreadygenerated and the entire source sentence when generating thenext word in the translation

26 / 28

Conclusions

I There is ongoing research on what is the best way to presentthe source sentence to the encoder

I There is also ongoing research on giving the decoder theability to attend to different parts of the input duringtranslation

I There is also interesting work on improving how these systemshandle idiomatic language

27 / 28

Thank you for your attention

[email protected]

@johndkelleher

www.comp.dit.ie/jkelleher

www.machinelearningbook.com

Acknowledgements: The ADAPT Centre is funded under the SFI Research Centres

Programme (Grant 13/RC/2106) and is co-funded under the European Regional

Development Fund.

[email protected]

@johndkelleher

www.comp.dit.ie/jkelleher

www.machinelearningbook.com

fundamentals of machine learning for machine translation · fundamentals of machine learning for...

Documents