week 12: memory networkswangruixuan/files/dl2019/week12... · 2019-05-16 · memory network dynamic...
TRANSCRIPT
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Week 12: Memory networks
Instructor: Ruixuan [email protected]
School of Data and Computer ScienceSun Yat-Sen University
16 May, 2019
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
1 Memory network
2 Dynamic memory network
3 Key-value memory network
4 Recurrent entity network
5 Neural Turing machine
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Question answering (QA)
Question answering: given a set (sequence) of facts and aquestion, predict the answer
Facts (sentences), question (sentence) and answer (word)
bAbI dataset: 20 toy tasks, with different types
Table here from Weston et al, “Towards AI-complete question answering: a set of prerequisite toy tasks”, arXiv,2015
Figures in next 6 slides from Sukhbaatar, Szlam, Weston, Fergus, “End-to-end memory networks”, arXiv, 2015
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Question answering (QA)
Question answering: given a set (sequence) of facts and aquestion, predict the answer
Facts (sentences), question (sentence) and answer (word)
bAbI dataset: 20 toy tasks, with different types
Table here from Weston et al, “Towards AI-complete question answering: a set of prerequisite toy tasks”, arXiv,2015
Figures in next 6 slides from Sukhbaatar, Szlam, Weston, Fergus, “End-to-end memory networks”, arXiv, 2015
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Memory network
Embed a collection of facts (i.e., sentences)Compute an attention depending on a query (i.e., question)
pi = Softmax(uTmi)
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Memory network
(Intermediate) output is weighted sum of the fact embeddings
o =∑i
pici
Input embedding is different from output embedding
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Memory network
The output o is then used as input to classifier
Question embedding is also part of input to classifier
a = Softmax(W(o + u))
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Memory network
Model parameters: embeddings A, B,C; dense output layer
Trained by minimizing a cross-entropy loss between predictiona and the true label a
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Memory network
Several ‘hops’ of ‘reasoning’ to produce the right answer
Embedding layers can be tied: A1 = A2 = A3,C1 = C2 = C3, making it similar to a RNN
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Memory network
Several ‘hops’ of ‘reasoning’ to produce the right answer
Embedding layers can be tied: A1 = A2 = A3,C1 = C2 = C3, making it similar to a RNN
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Memory network
Trained end-to-end from (Facts+Questions) to (Answers)
Attented fact at each hop may be changed during inference
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Limitation of the memory network
It did not consider temporal order between sentences
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
Multiple facts (sentences) are connected (by ‘end-of-sentence’token) into a single longer sequence
A GRU RNN generates representation Si for each sentence
2nd GRU generates representation q for Question sentence
Figures here and next 5 slides from Kumar et al., “Ask me anything: dynamic memory networks for naturallanguage processing”, ICML, 2016
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
Multiple facts (sentences) are connected (by ‘end-of-sentence’token) into a single longer sequence
A GRU RNN generates representation Si for each sentence
2nd GRU generates representation q for Question sentence
Figures here and next 5 slides from Kumar et al., “Ask me anything: dynamic memory networks for naturallanguage processing”, ICML, 2016
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
Attention (blue arrows & values nearby) by a 2-layer network
Episode et by a modified 3rd GRU, with attention as gate totune GRU output
Episode at final time step represents attended (relevant) facts
4th GRU generates episode memory m (e and q as input)
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
Attention (blue arrows & values nearby) by a 2-layer network
Episode et by a modified 3rd GRU, with attention as gate totune GRU output
Episode at final time step represents attended (relevant) facts
4th GRU generates episode memory m (e and q as input)
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
Attention (blue arrows & values nearby) by a 2-layer network
Episode et by a modified 3rd GRU, with attention as gate totune GRU output
Episode at final time step represents attended (relevant) facts
4th GRU generates episode memory m (e and q as input)
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
There are multiple ‘passes’ (iterations) in episodic memory
In 2nd pass: attention takes prev episodic memory m1 as inputm1 as input to 4th GRU for updated episodic memory m2
m should contain all information to answer question qNote: only higher attention (blue arrows) were drawn
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
There are multiple ‘passes’ (iterations) in episodic memoryIn 2nd pass: attention takes prev episodic memory m1 as input
m1 as input to 4th GRU for updated episodic memory m2
m should contain all information to answer question qNote: only higher attention (blue arrows) were drawn
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
There are multiple ‘passes’ (iterations) in episodic memoryIn 2nd pass: attention takes prev episodic memory m1 as inputm1 as input to 4th GRU for updated episodic memory m2
m should contain all information to answer question qNote: only higher attention (blue arrows) were drawn
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
5th GRU generates answer; episodic memory m as initial state
Concatenate last generated word and question vector as input
Trained with cross-entropy loss between (predicted single yt
or sequence {yt}) and (correct word/sequence)
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
5th GRU generates answer; episodic memory m as initial state
Concatenate last generated word and question vector as input
Trained with cross-entropy loss between (predicted single yt
or sequence {yt}) and (correct word/sequence)
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
DMN capture information of position and temporality
The entire DMN can be trained end-to-end
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Dynamic memory network
Episodic memory module helps iteratively retrieve and storefacts, gradually incorporating more relevant information
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Limitations of dynamic memory network
DMN requires supporting facts (i.e., the facts that are relevantfor answering a particular question) are labeled during training
DMN did not work well on bAbI-10k without supporting facts
Reason 1: when encoding a fact, not use future sentences
Reason 2: in input module, word-level GRU does not allow fordistant supporting sentences to interact with each other
Figure in next 2 slides from Xiong, Merity, Socher, “Dynamic memory networks for visual and textual questionanswering”, ICML, 2016
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Limitations of dynamic memory network
DMN requires supporting facts (i.e., the facts that are relevantfor answering a particular question) are labeled during training
DMN did not work well on bAbI-10k without supporting facts
Reason 1: when encoding a fact, not use future sentences
Reason 2: in input module, word-level GRU does not allow fordistant supporting sentences to interact with each other
Figure in next 2 slides from Xiong, Merity, Socher, “Dynamic memory networks for visual and textual questionanswering”, ICML, 2016
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
DMN+: improved dynamic memory network
Two-level encoding of facts (sequences)
Input fusion layer allows information flow between sentences
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
DMN+: improved dynamic memory network
Attention score is used inside GRU at each time step
Memory update: ReLU of linear embedding, unique per pass
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Extending memory networks
There are other ways to extend memory networks!
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Key-value memory: e.g., key is a window of words in doc,value is the centre word in window or title of doc, etc.
Key hashing for subset where Key shares word with Question
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Key-value memory: e.g., key is a window of words in doc,value is the centre word in window or title of doc, etc.
Key hashing for subset where Key shares word with Question
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
ΦX , ΦK : predefined sentence coding; A: embeddingKey addressing computes an attention depending on Question
phi= Softmax(AΦX(x) ·AΦK(khi
))
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Value reading outputs weighted sum of value embeddings
o =∑i
phiAΦV (vhi
)
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Select from candidate answers {yi} which are all possibleanswer words or sentences by matching o (which containsanswer information):
a = arg maxi=1,...,C
Softmax(oT BΦY (yi))
This is another difference from standard memory network
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Select from candidate answers {yi} which are all possibleanswer words or sentences by matching o (which containsanswer information):
a = arg maxi=1,...,C
Softmax(oT BΦY (yi))
This is another difference from standard memory network
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Addressing and reading can be repeated (‘Hops’), each timewith updated query (question embedding) qi+1 = Ri(qi + oi)
Model parameters: A,B,R1, . . . ,RH
Train: cross entropy loss between predicted and true answers
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Addressing and reading can be repeated (‘Hops’), each timewith updated query (question embedding) qi+1 = Ri(qi + oi)
Model parameters: A,B,R1, . . . ,RH
Train: cross entropy loss between predicted and true answers
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Key-value memory network
Model is better than others on WikiQA and MOVIEQA sets
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
How does human read and understand a story?
The ability to maintain and update the state of the world is akey feature of intelligent agents!
1. ‘Mary picked up the ball.’2. ‘Mary went to the garden.’
Explicitly maintain a set of high-level concepts or entitiestogether with their properties, which are updated as newinformation is received
Figures in previous 6 slides from Miller et al., “Key-value memory networks for directly reading documents”, arXiv,2016
Figures in next 6 slides from Henaff, Weston, Szlam, Bordes, LeCun, “Tracking the world state with recurrententity networks”, ICLR, 2017
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
How does human read and understand a story?
The ability to maintain and update the state of the world is akey feature of intelligent agents!
1. ‘Mary picked up the ball.’2. ‘Mary went to the garden.’
Explicitly maintain a set of high-level concepts or entitiestogether with their properties, which are updated as newinformation is received
Figures in previous 6 slides from Miller et al., “Key-value memory networks for directly reading documents”, arXiv,2016
Figures in next 6 slides from Henaff, Weston, Szlam, Bordes, LeCun, “Tracking the world state with recurrententity networks”, ICLR, 2017
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network
Multiple (5 ∼ 20) dynamic memory cells: (key wj , value hj)’s
Novelty 1: each cell is associated with its own ‘processor’, agated recurrent network that update cell value given an input
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network
Multiple (5 ∼ 20) dynamic memory cells: (key wj , value hj)’sNovelty 1: each cell is associated with its own ‘processor’, agated recurrent network that update cell value given an input
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network: dynamic memory
Each cell stores state of one entity (e.g., objects, persons)
The jth network independently uses a key wj and the tth
input sentence st to update the memory hj at time step t
Keys {wj} can be learned as parameters, or entity words indoc or candidate answers
U, V, W are shared by all gated recurrent networks
L2 norm helps forget old memory
gj,t = σ(sTt hj,t−1 + sTt wj)
hj,t = φ(Uhj,t−1 + Vwj + Wst)
hj,t = hj,t−1 + gj,t � hj,t
hj,t ←hj,t
‖hj,t‖
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network: dynamic memory
Each cell stores state of one entity (e.g., objects, persons)
The jth network independently uses a key wj and the tth
input sentence st to update the memory hj at time step t
Keys {wj} can be learned as parameters, or entity words indoc or candidate answers
U, V, W are shared by all gated recurrent networks
L2 norm helps forget old memory
gj,t = σ(sTt hj,t−1 + sTt wj)
hj,t = φ(Uhj,t−1 + Vwj + Wst)
hj,t = hj,t−1 + gj,t � hj,t
hj,t ←hj,t
‖hj,t‖
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network: gate function
Novelty 2: input st directly interacts with wj and memory hj
The gate function gj,t contains two terms
‘Content’ term sTt hj,t−1 cause gate to open for memory slotswhose content hj,t−1 matches the input
‘Location’ term sTt wj cause gate to open for memory slotswhose key wj matches the input
gj,t = σ(sTt hj,t−1 + sTt wj)
Memories are updated only when new information relevant totheir concepts are received, and remains otherwise unchanged
hj,t = hj,t−1 + gj,t � hj,t
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network: gate function
Novelty 2: input st directly interacts with wj and memory hj
The gate function gj,t contains two terms
‘Content’ term sTt hj,t−1 cause gate to open for memory slotswhose content hj,t−1 matches the input
‘Location’ term sTt wj cause gate to open for memory slotswhose key wj matches the input
gj,t = σ(sTt hj,t−1 + sTt wj)
Memories are updated only when new information relevant totheir concepts are received, and remains otherwise unchanged
hj,t = hj,t−1 + gj,t � hj,t
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network: gate function
Novelty 2: input st directly interacts with wj and memory hj
The gate function gj,t contains two terms
‘Content’ term sTt hj,t−1 cause gate to open for memory slotswhose content hj,t−1 matches the input
‘Location’ term sTt wj cause gate to open for memory slotswhose key wj matches the input
gj,t = σ(sTt hj,t−1 + sTt wj)
Memories are updated only when new information relevant totheir concepts are received, and remains otherwise unchanged
hj,t = hj,t−1 + gj,t � hj,t
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
An example of how the model work
1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’
Two gated networks (memories) for states of ‘Mary’ and ‘ball’.
Key w1 for entity ‘Mary’; w2 for entity ‘Ball’
1st time step: s1 encoding ‘Mary picked up the ball.’
Location term sT1 w1 would open gates of memory ‘Mary’
Location term sT1 w2 would open gates of memory ‘ball’.
Updated state h1,1 of ‘Mary’ entity: she is carrying the ball
Updated state h2,1 of ‘ball’ entity: it is carried by Mary
2nd time step: s2 encoding ‘Mary went to the garden.’
Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden
Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.
Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
An example of how the model work
1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’
Two gated networks (memories) for states of ‘Mary’ and ‘ball’.
Key w1 for entity ‘Mary’; w2 for entity ‘Ball’
1st time step: s1 encoding ‘Mary picked up the ball.’
Location term sT1 w1 would open gates of memory ‘Mary’
Location term sT1 w2 would open gates of memory ‘ball’.
Updated state h1,1 of ‘Mary’ entity: she is carrying the ball
Updated state h2,1 of ‘ball’ entity: it is carried by Mary
2nd time step: s2 encoding ‘Mary went to the garden.’
Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden
Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.
Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
An example of how the model work
1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’
Two gated networks (memories) for states of ‘Mary’ and ‘ball’.
Key w1 for entity ‘Mary’; w2 for entity ‘Ball’
1st time step: s1 encoding ‘Mary picked up the ball.’
Location term sT1 w1 would open gates of memory ‘Mary’
Location term sT1 w2 would open gates of memory ‘ball’.
Updated state h1,1 of ‘Mary’ entity: she is carrying the ball
Updated state h2,1 of ‘ball’ entity: it is carried by Mary
2nd time step: s2 encoding ‘Mary went to the garden.’
Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden
Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.
Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
An example of how the model work
1st: ‘Mary picked up the ball.’; 2nd: ‘Mary went to the garden.’
Two gated networks (memories) for states of ‘Mary’ and ‘ball’.
Key w1 for entity ‘Mary’; w2 for entity ‘Ball’
1st time step: s1 encoding ‘Mary picked up the ball.’
Location term sT1 w1 would open gates of memory ‘Mary’
Location term sT1 w2 would open gates of memory ‘ball’.
Updated state h1,1 of ‘Mary’ entity: she is carrying the ball
Updated state h2,1 of ‘ball’ entity: it is carried by Mary
2nd time step: s2 encoding ‘Mary went to the garden.’
Both content term sT2 h1,1 and location term sT2 w1 helpupdate ‘Mary’ entity: she is now in the garden
Content term sT2 h2,1 would open gate of ‘ball’, even thoughthe word ‘ball’ does not occur in the second sentence.
Why: infor for ‘Mary’ is in ’ball’ memory from 1st time step,
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network: input and output modules
Input module:
ei: ith word in tth input sequence; fi: model parameters
st = Σifi � ei
Output module:
q: query (question) vector; R and H: model parameters
Attention mechanism (pj) is adopted as well
pj = Softmax(qThj)
u = Σjpjhj
y = Rφ(qj + Hu)
Model (including previous gated networks) trained end-to-end
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Recurrent entity network: result
The model solved all 20 bAbI question-answering tasks
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Before neural Turing machine
Feedforward networks process input, then generate output
Figures here and in next 4 slides from https://jamiekang.github.io/
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Before neural Turing machine
RNN did similarly, with recurrent connections inside
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Neural Turing machine
Neural Turing machine uses controller (either feed-forward orRNN) to read from and write to external memory, thengenerate output from the read result.
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Neural Turing machine: reading and writing
Memory M is N ×M : N locations, M elements per location
wt = (wt(0), . . . , wt(N − 1)): reading weight, from read head
rt: weighted row vectors in memory; returned by read head
From write head: weight wt, erase vector et, add vector at
Fine-grained control over elements in each memory location
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Neural Turing machine: reading and writing
Memory M is N ×M : N locations, M elements per location
wt = (wt(0), . . . , wt(N − 1)): reading weight, from read head
rt: weighted row vectors in memory; returned by read head
From write head: weight wt, erase vector et, add vector at
Fine-grained control over elements in each memory location
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Neural Turing machine: reading and writing
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
How to generate read weight and write information?
Content-based addressing: attended to locations wherememory content is similar to values kt emitted by controller
Location-based: by rotationally shifting elements of a weight
Can be combined, also with interpolation and sharpening
Figures here and in next slide from Graves, Wayne, Danihelka, “Neural Turing machines”, arXiv, 2014
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Neural Turing machine: result
Model parameters are mainly in Controller; trained end-to-end
Trained to copy sequence (length up to 20) of 8-bit randomvectors; tested on sequences of length 10, 20, 30, 50, 120
NTM can copy longer sequences (below); while LSTM not
NTM can infer simple algorithms like copying, sorting, etc.
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Neural Turing machine: result
Model parameters are mainly in Controller; trained end-to-end
Trained to copy sequence (length up to 20) of 8-bit randomvectors; tested on sequences of length 10, 20, 30, 50, 120
NTM can copy longer sequences (below); while LSTM not
NTM can infer simple algorithms like copying, sorting, etc.
Memory network Dynamic memory network Key-value memory network Recurrent entity network Neural Turing machine
Summary
Memory networks often contain external memory module
Memory networks can be trained end-to-end
Dynamic memory networks consider sentence order
Key-value network is more flexible than memory networks
Recurrent entity networks can directly update entities’ states
Neural Turing machine can infer simple algorithms
Focus is how to generate and use memory information
Further reading:
Graves et al., ‘Hybrid computing using a neural network withdynamic external memory’, Nature, 2016
Chandar et al., ‘Hierarchical memory networks’, ICLR, 2017