week 12: memory networkswangruixuan/files/dl2019/week12... · 2019-05-16 · memory network dynamic...

63
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine Week 12: Memory networks Instructor: Ruixuan Wang [email protected] School of Data and Computer Science Sun Yat-Sen University 16 May, 2019

Upload: others

Post on 30-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Week 12: Memory networks

Instructor: Ruixuan [email protected]

School of Data and Computer ScienceSun Yat-Sen University

16 May, 2019

Page 2: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

1 Memory network

2 Dynamic memory network

3 Key-value memory network

4 Recurrent entity network

5 Neural Turing machine

Page 3: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Question answering (QA)

Question answering: given a set (sequence) of facts and aquestion, predict the answer

Facts (sentences), question (sentence) and answer (word)

bAbI dataset: 20 toy tasks, with different types

Table here from Weston et al, “Towards AI-complete question answering: a set of prerequisite toy tasks”, arXiv,2015

Figures in next 6 slides from Sukhbaatar, Szlam, Weston, Fergus, “End-to-end memory networks”, arXiv, 2015

Page 4: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Question answering (QA)

Question answering: given a set (sequence) of facts and aquestion, predict the answer

Facts (sentences), question (sentence) and answer (word)

bAbI dataset: 20 toy tasks, with different types

Table here from Weston et al, “Towards AI-complete question answering: a set of prerequisite toy tasks”, arXiv,2015

Figures in next 6 slides from Sukhbaatar, Szlam, Weston, Fergus, “End-to-end memory networks”, arXiv, 2015

Page 5: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Memory network

Embed a collection of facts (i.e., sentences)Compute an attention depending on a query (i.e., question)

pi = Softmax(uTmi)

Page 6: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Memory network

(Intermediate) output is weighted sum of the fact embeddings

o =∑i

pici

Input embedding is different from output embedding

Page 7: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Memory network

The output o is then used as input to classifier

Question embedding is also part of input to classifier

a = Softmax(W(o + u))

Page 8: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Memory network

Model parameters: embeddings A, B,C; dense output layer

Trained by minimizing a cross-entropy loss between predictiona and the true label a

Page 9: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Memory network

Several ‘hops’ of ‘reasoning’ to produce the right answer

Embedding layers can be tied: A1 = A2 = A3,C1 = C2 = C3, making it similar to a RNN

Page 10: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Memory network

Several ‘hops’ of ‘reasoning’ to produce the right answer

Embedding layers can be tied: A1 = A2 = A3,C1 = C2 = C3, making it similar to a RNN

Page 11: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Memory network

Trained end-to-end from (Facts+Questions) to (Answers)

Attented fact at each hop may be changed during inference

Page 12: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Limitation of the memory network

It did not consider temporal order between sentences

Page 13: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

Multiple facts (sentences) are connected (by ‘end-of-sentence’token) into a single longer sequence

A GRU RNN generates representation Si for each sentence

2nd GRU generates representation q for Question sentence

Figures here and next 5 slides from Kumar et al., “Ask me anything: dynamic memory networks for naturallanguage processing”, ICML, 2016

Page 14: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

Multiple facts (sentences) are connected (by ‘end-of-sentence’token) into a single longer sequence

A GRU RNN generates representation Si for each sentence

2nd GRU generates representation q for Question sentence

Figures here and next 5 slides from Kumar et al., “Ask me anything: dynamic memory networks for naturallanguage processing”, ICML, 2016

Page 15: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

Attention (blue arrows & values nearby) by a 2-layer network

Episode et by a modified 3rd GRU, with attention as gate totune GRU output

Episode at final time step represents attended (relevant) facts

4th GRU generates episode memory m (e and q as input)

Page 16: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

Attention (blue arrows & values nearby) by a 2-layer network

Episode et by a modified 3rd GRU, with attention as gate totune GRU output

Episode at final time step represents attended (relevant) facts

4th GRU generates episode memory m (e and q as input)

Page 17: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

Attention (blue arrows & values nearby) by a 2-layer network

Episode et by a modified 3rd GRU, with attention as gate totune GRU output

Episode at final time step represents attended (relevant) facts

4th GRU generates episode memory m (e and q as input)

Page 18: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

There are multiple ‘passes’ (iterations) in episodic memory

In 2nd pass: attention takes prev episodic memory m1 as inputm1 as input to 4th GRU for updated episodic memory m2

m should contain all information to answer question qNote: only higher attention (blue arrows) were drawn

Page 19: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

There are multiple ‘passes’ (iterations) in episodic memoryIn 2nd pass: attention takes prev episodic memory m1 as input

m1 as input to 4th GRU for updated episodic memory m2

m should contain all information to answer question qNote: only higher attention (blue arrows) were drawn

Page 20: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

There are multiple ‘passes’ (iterations) in episodic memoryIn 2nd pass: attention takes prev episodic memory m1 as inputm1 as input to 4th GRU for updated episodic memory m2

m should contain all information to answer question qNote: only higher attention (blue arrows) were drawn

Page 21: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

5th GRU generates answer; episodic memory m as initial state

Concatenate last generated word and question vector as input

Trained with cross-entropy loss between (predicted single yt

or sequence {yt}) and (correct word/sequence)

Page 22: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

5th GRU generates answer; episodic memory m as initial state

Concatenate last generated word and question vector as input

Trained with cross-entropy loss between (predicted single yt

or sequence {yt}) and (correct word/sequence)

Page 23: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

DMN capture information of position and temporality

The entire DMN can be trained end-to-end

Page 24: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Dynamic memory network

Episodic memory module helps iteratively retrieve and storefacts, gradually incorporating more relevant information

Page 25: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Limitations of dynamic memory network

DMN requires supporting facts (i.e., the facts that are relevantfor answering a particular question) are labeled during training

DMN did not work well on bAbI-10k without supporting facts

Reason 1: when encoding a fact, not use future sentences

Reason 2: in input module, word-level GRU does not allow fordistant supporting sentences to interact with each other

Figure in next 2 slides from Xiong, Merity, Socher, “Dynamic memory networks for visual and textual questionanswering”, ICML, 2016

Page 26: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Limitations of dynamic memory network

DMN requires supporting facts (i.e., the facts that are relevantfor answering a particular question) are labeled during training

DMN did not work well on bAbI-10k without supporting facts

Reason 1: when encoding a fact, not use future sentences

Reason 2: in input module, word-level GRU does not allow fordistant supporting sentences to interact with each other

Figure in next 2 slides from Xiong, Merity, Socher, “Dynamic memory networks for visual and textual questionanswering”, ICML, 2016

Page 27: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

DMN+: improved dynamic memory network

Two-level encoding of facts (sequences)

Input fusion layer allows information flow between sentences

Page 28: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

DMN+: improved dynamic memory network

Attention score is used inside GRU at each time step

Memory update: ReLU of linear embedding, unique per pass

Page 29: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Extending memory networks

There are other ways to extend memory networks!

Page 30: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Key-value memory: e.g., key is a window of words in doc,value is the centre word in window or title of doc, etc.

Key hashing for subset where Key shares word with Question

Page 31: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Key-value memory: e.g., key is a window of words in doc,value is the centre word in window or title of doc, etc.

Key hashing for subset where Key shares word with Question

Page 32: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

ΦX , ΦK : predefined sentence coding; A: embeddingKey addressing computes an attention depending on Question

phi= Softmax(AΦX(x) ·AΦK(khi

))

Page 33: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Value reading outputs weighted sum of value embeddings

o =∑i

phiAΦV (vhi

)

Page 34: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Select from candidate answers {yi} which are all possibleanswer words or sentences by matching o (which containsanswer information):

a = arg maxi=1,...,C

Softmax(oT BΦY (yi))

This is another difference from standard memory network

Page 35: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Select from candidate answers {yi} which are all possibleanswer words or sentences by matching o (which containsanswer information):

a = arg maxi=1,...,C

Softmax(oT BΦY (yi))

This is another difference from standard memory network

Page 36: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Addressing and reading can be repeated (‘Hops’), each timewith updated query (question embedding) qi+1 = Ri(qi + oi)

Model parameters: A,B,R1, . . . ,RH

Train: cross entropy loss between predicted and true answers

Page 37: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Addressing and reading can be repeated (‘Hops’), each timewith updated query (question embedding) qi+1 = Ri(qi + oi)

Model parameters: A,B,R1, . . . ,RH

Train: cross entropy loss between predicted and true answers

Page 38: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Key-value memory network

Model is better than others on WikiQA and MOVIEQA sets

Page 39: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

How does human read and understand a story?

The ability to maintain and update the state of the world is akey feature of intelligent agents!

1. ‘Mary picked up the ball.’2. ‘Mary went to the garden.’

Explicitly maintain a set of high-level concepts or entitiestogether with their properties, which are updated as newinformation is received

Figures in previous 6 slides from Miller et al., “Key-value memory networks for directly reading documents”, arXiv,2016

Figures in next 6 slides from Henaff, Weston, Szlam, Bordes, LeCun, “Tracking the world state with recurrententity networks”, ICLR, 2017

Page 40: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

How does human read and understand a story?

The ability to maintain and update the state of the world is akey feature of intelligent agents!

1. ‘Mary picked up the ball.’2. ‘Mary went to the garden.’

Explicitly maintain a set of high-level concepts or entitiestogether with their properties, which are updated as newinformation is received

Figures in previous 6 slides from Miller et al., “Key-value memory networks for directly reading documents”, arXiv,2016

Figures in next 6 slides from Henaff, Weston, Szlam, Bordes, LeCun, “Tracking the world state with recurrententity networks”, ICLR, 2017

Page 41: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network

Multiple (5 ∼ 20) dynamic memory cells: (key wj , value hj)’s

Novelty 1: each cell is associated with its own ‘processor’, agated recurrent network that update cell value given an input

Page 42: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network

Multiple (5 ∼ 20) dynamic memory cells: (key wj , value hj)’sNovelty 1: each cell is associated with its own ‘processor’, agated recurrent network that update cell value given an input

Page 43: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network: dynamic memory

Each cell stores state of one entity (e.g., objects, persons)

The jth network independently uses a key wj and the tth

input sentence st to update the memory hj at time step t

Keys {wj} can be learned as parameters, or entity words indoc or candidate answers

U, V, W are shared by all gated recurrent networks

L2 norm helps forget old memory

gj,t = σ(sTt hj,t−1 + sTt wj)

hj,t = φ(Uhj,t−1 + Vwj + Wst)

hj,t = hj,t−1 + gj,t � hj,t

hj,t ←hj,t

‖hj,t‖

Page 44: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network: dynamic memory

Each cell stores state of one entity (e.g., objects, persons)

The jth network independently uses a key wj and the tth

input sentence st to update the memory hj at time step t

Keys {wj} can be learned as parameters, or entity words indoc or candidate answers

U, V, W are shared by all gated recurrent networks

L2 norm helps forget old memory

gj,t = σ(sTt hj,t−1 + sTt wj)

hj,t = φ(Uhj,t−1 + Vwj + Wst)

hj,t = hj,t−1 + gj,t � hj,t

hj,t ←hj,t

‖hj,t‖

Page 45: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network: gate function

Novelty 2: input st directly interacts with wj and memory hj

The gate function gj,t contains two terms

‘Content’ term sTt hj,t−1 cause gate to open for memory slotswhose content hj,t−1 matches the input

‘Location’ term sTt wj cause gate to open for memory slotswhose key wj matches the input

gj,t = σ(sTt hj,t−1 + sTt wj)

Memories are updated only when new information relevant totheir concepts are received, and remains otherwise unchanged

hj,t = hj,t−1 + gj,t � hj,t

Page 46: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network: gate function

Novelty 2: input st directly interacts with wj and memory hj

The gate function gj,t contains two terms

‘Content’ term sTt hj,t−1 cause gate to open for memory slotswhose content hj,t−1 matches the input

‘Location’ term sTt wj cause gate to open for memory slotswhose key wj matches the input

gj,t = σ(sTt hj,t−1 + sTt wj)

Memories are updated only when new information relevant totheir concepts are received, and remains otherwise unchanged

hj,t = hj,t−1 + gj,t � hj,t

Page 47: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network: gate function

Novelty 2: input st directly interacts with wj and memory hj

The gate function gj,t contains two terms

‘Content’ term sTt hj,t−1 cause gate to open for memory slotswhose content hj,t−1 matches the input

‘Location’ term sTt wj cause gate to open for memory slotswhose key wj matches the input

gj,t = σ(sTt hj,t−1 + sTt wj)

Memories are updated only when new information relevant totheir concepts are received, and remains otherwise unchanged

hj,t = hj,t−1 + gj,t � hj,t

Page 48: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

An example of how the model work

1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’

Two gated networks (memories) for states of ‘Mary’ and ‘ball’.

Key w1 for entity ‘Mary’; w2 for entity ‘Ball’

1st time step: s1 encoding ‘Mary picked up the ball.’

Location term sT1 w1 would open gates of memory ‘Mary’

Location term sT1 w2 would open gates of memory ‘ball’.

Updated state h1,1 of ‘Mary’ entity: she is carrying the ball

Updated state h2,1 of ‘ball’ entity: it is carried by Mary

2nd time step: s2 encoding ‘Mary went to the garden.’

Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden

Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.

Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,

Page 49: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

An example of how the model work

1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’

Two gated networks (memories) for states of ‘Mary’ and ‘ball’.

Key w1 for entity ‘Mary’; w2 for entity ‘Ball’

1st time step: s1 encoding ‘Mary picked up the ball.’

Location term sT1 w1 would open gates of memory ‘Mary’

Location term sT1 w2 would open gates of memory ‘ball’.

Updated state h1,1 of ‘Mary’ entity: she is carrying the ball

Updated state h2,1 of ‘ball’ entity: it is carried by Mary

2nd time step: s2 encoding ‘Mary went to the garden.’

Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden

Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.

Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,

Page 50: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

An example of how the model work

1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’

Two gated networks (memories) for states of ‘Mary’ and ‘ball’.

Key w1 for entity ‘Mary’; w2 for entity ‘Ball’

1st time step: s1 encoding ‘Mary picked up the ball.’

Location term sT1 w1 would open gates of memory ‘Mary’

Location term sT1 w2 would open gates of memory ‘ball’.

Updated state h1,1 of ‘Mary’ entity: she is carrying the ball

Updated state h2,1 of ‘ball’ entity: it is carried by Mary

2nd time step: s2 encoding ‘Mary went to the garden.’

Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden

Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.

Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,

Page 51: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

An example of how the model work

1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’

Two gated networks (memories) for states of ‘Mary’ and ‘ball’.

Key w1 for entity ‘Mary’; w2 for entity ‘Ball’

1st time step: s1 encoding ‘Mary picked up the ball.’

Location term sT1 w1 would open gates of memory ‘Mary’

Location term sT1 w2 would open gates of memory ‘ball’.

Updated state h1,1 of ‘Mary’ entity: she is carrying the ball

Updated state h2,1 of ‘ball’ entity: it is carried by Mary

2nd time step: s2 encoding ‘Mary went to the garden.’

Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden

Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.

Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,

Page 52: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network: input and output modules

Input module:

ei: ith word in tth input sequence; fi: model parameters

st = Σifi � ei

Output module:

q: query (question) vector; R and H: model parameters

Attention mechanism (pj) is adopted as well

pj = Softmax(qThj)

u = Σjpjhj

y = Rφ(qj + Hu)

Model (including previous gated networks) trained end-to-end

Page 53: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Recurrent entity network: result

The model solved all 20 bAbI question-answering tasks

Page 54: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Before neural Turing machine

Feedforward networks process input, then generate output

Figures here and in next 4 slides from https://jamiekang.github.io/

Page 55: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Before neural Turing machine

RNN did similarly, with recurrent connections inside

Page 56: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Neural Turing machine

Neural Turing machine uses controller (either feed-forward orRNN) to read from and write to external memory, thengenerate output from the read result.

Page 57: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Neural Turing machine: reading and writing

Memory M is N ×M : N locations, M elements per location

wt = (wt(0), . . . , wt(N − 1)): reading weight, from read head

rt: weighted row vectors in memory; returned by read head

From write head: weight wt, erase vector et, add vector at

Fine-grained control over elements in each memory location

Page 58: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Neural Turing machine: reading and writing

Memory M is N ×M : N locations, M elements per location

wt = (wt(0), . . . , wt(N − 1)): reading weight, from read head

rt: weighted row vectors in memory; returned by read head

From write head: weight wt, erase vector et, add vector at

Fine-grained control over elements in each memory location

Page 59: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Neural Turing machine: reading and writing

Page 60: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

How to generate read weight and write information?

Content-based addressing: attended to locations wherememory content is similar to values kt emitted by controller

Location-based: by rotationally shifting elements of a weight

Can be combined, also with interpolation and sharpening

Figures here and in next slide from Graves, Wayne, Danihelka, “Neural Turing machines”, arXiv, 2014

Page 61: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Neural Turing machine: result

Model parameters are mainly in Controller; trained end-to-end

Trained to copy sequence (length up to 20) of 8-bit randomvectors; tested on sequences of length 10, 20, 30, 50, 120

NTM can copy longer sequences (below); while LSTM not

NTM can infer simple algorithms like copying, sorting, etc.

Page 62: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Neural Turing machine: result

Model parameters are mainly in Controller; trained end-to-end

Trained to copy sequence (length up to 20) of 8-bit randomvectors; tested on sequences of length 10, 20, 30, 50, 120

NTM can copy longer sequences (below); while LSTM not

NTM can infer simple algorithms like copying, sorting, etc.

Page 63: Week 12: Memory networkswangruixuan/files/DL2019/week12... · 2019-05-16 · Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing

Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine

Summary

Memory networks often contain external memory module

Memory networks can be trained end-to-end

Dynamic memory networks consider sentence order

Key-value network is more flexible than memory networks

Recurrent entity networks can directly update entities’ states

Neural Turing machine can infer simple algorithms

Focus is how to generate and use memory information

Further reading:

Graves et al., ‘Hybrid computing using a neural network withdynamic external memory’, Nature, 2016

Chandar et al., ‘Hierarchical memory networks’, ICLR, 2017