Neural Turing Machine
Author: Alex Graves, Greg Wayne, Ivo Danihelka
Presented By: Tinghui Wang (Steve)
Introduction
β’Neural Turning Machine: Couple a Neural Network with
external memory resources
β’ The combined system is analogous to TM, but differentiable
from end to end
β’NTM Can infer simple algorithms!
4/6/2015 Tinghui Wang, EECS, WSU 2
In Computation Theoryβ¦
Finite State Machine
πΊ, πΊ, ππ, πΉ, π
β’πΊ: Alphabet
β’πΊ: Finite set of states
β’ππ: Initial state
β’πΉ: state transition function:
πΊ Γ πΊ β π· πΊ
β’π: set of Final States
4/6/2015 Tinghui Wang, EECS, WSU 3
Push Down Automata
πΈ, πΊ, πͺ, πΉ, ππ, π, π
β’πΊ: Input Alphabet
β’πͺ: Stack Alphabet
β’πΈ: Finite set of states
β’ππ: Initial state
β’πΉ: state transition function:
πΈ Γ πΊ Γ πͺ β π· πΈ Γ πͺ
β’π: set of Final States
In Computation Theoryβ¦
4/6/2015 Tinghui Wang, EECS, WSU 4
Turing Machine
πΈ, πͺ, π, πΊ, πΉ, ππ, π
β’πΈ: Finite set of States
β’ππ: Initial State
β’πͺ: Tape alphabet (Finite)
β’π: Blank Symbol (occurs
infinitely on the tape)
β’πΊ: Input Alphabet (Ξ£ = πͺ\ π )
β’πΉ: Transition Function
π\F Γ Ξ β π Γ Ξ Γ πΏ, π
β’π: Final State (Accepted)
4/6/2015 Tinghui Wang, EECS, WSU 5
Neural Turing Machine - Add Learning to TM!!
4/6/2015 Tinghui Wang, EECS, WSU 6
Finite State Machine (Program) Neural Network (I Can Learn!!)
A Little History on Neural Network
β’ 1950s: Frank Rosenblatt, Perceptron β classification based
on linear predictor
β’ 1969: Minsky β Proof that Perceptron Sucks! β Not able to
learn XOR
β’ 1980s: Back propagation
β’ 1990s: Recurrent Neural Network, Long Short-Term Memory
β’ Late 2000s β now: Deep Learning, Fast Computer
4/6/2015 Tinghui Wang, EECS, WSU 7
Feed-Forward Neural Net and Back Propagation
4/6/2015 Tinghui Wang, EECS, WSU 8
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October
1986). "Learning representations by back-propagating
errors". Nature 323 (6088): 533β536.
β’ Total Input of unit π:
π₯π =
π
π¦ππ€ππ
β’ Output of unit π (logistic function)
π¦π =1
1 + πβπ₯π
β’ Total Error (mean square)
πΈπ‘ππ‘ππ =1
2
π
π
ππ,π β π¦π,π2
β’ Gradient Descent
ππΈ
ππ€ππ=
ππΈ
ππ¦π
ππ¦π
ππ₯π
ππ₯π
ππ€ππ
πππ€ππ
Recurrent Neural Network
β’ π¦ π‘ : n-tuple of outputs at time t
β’ π₯πππ‘ π‘ : m-tuple of external inputs at time t
β’ π: set of indices π that π₯π is output of unit in
network
β’ πΌ: set of indices π that π₯π is external input
β’ π₯π = π₯ππππ‘ π‘ , ππ π β πΌ
π¦π π‘ , ππ π β π
β’ π¦π π‘ + 1 = ππ π π π‘ + 1 where ππ is a differential
squashing function
β’ π π π‘ + 1 = πβπΌβͺππ€πππ₯π π‘
4/6/2015 Tinghui Wang, EECS, WSU 9
R. J. Williams and D. Zipser. Gradient-based learning
algorithms for recurrent networks and their computational
complexity. In Back-propagation: Theory, Architectures and
Applications. Hillsdale, NJ: Erlbaum, 1994.
Recurrent Neural Network β Time Unrolling
β’ Error Function:
π½ π‘ = β1
2
πβπ
ππ π‘ 2 = β1
2
πβπ
ππ π‘ β π¦π π‘ 2
π½πππ‘ππ π‘β², π‘ = π=π‘β²+1
π‘
π½ π
β’ Back Propagation through Time:
νπ π‘ =ππ½ π‘
ππ¦π π‘= ππ π‘ = ππ π‘ β π¦π π‘
πΏπ π =ππ½ π‘
ππ π π=
ππ½ π‘
ππ¦π π
ππ¦π π
ππ π π= ππ
β² π π π νπ π
νπ π β 1 =ππ½ π‘
ππ¦π π β 1=
πβπ
ππ½ π‘
ππ π π
ππ π π
ππ¦π π β 1=
πβπ
π€πππΏπ π
ππ½πππ‘ππ
ππ€ππ=
π=π‘β²+1
π‘ ππ½ π‘
ππ π π
ππ π π
ππ€ππ4/6/2015 Tinghui Wang, EECS, WSU 10
RNN & Long Short-Term Memory
β’ RNN is Turing Complete
With Proper Configuration, it can simulate any sequence generated by a
Turing Machine
β’ Error Vanishing and Exploding
Consider single unit self loop:
πΏπ π =ππ½ π‘
ππ π π=
ππ½ π‘
ππ¦π π
ππ¦π π
ππ π π= ππ
β² π π π νπ π = ππβ² π π π π€πππΏπ π + 1
β’ Long Short-term Memory
4/6/2015 Tinghui Wang, EECS, WSU 11
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735β1780.
Purpose of Neural Turing Machine
β’ Enrich the capabilities of standard RNN to simplify the
solution of algorithmic tasks
4/6/2015 Tinghui Wang, EECS, WSU 12
NTM Read/Write Operation
4/6/2015 Tinghui Wang, EECS, WSU 13
M-S
ize
Vec
tor
1 2 3 N...
w1 w2 w3 wN...
ππ‘ =
π
π€π‘ π ππ‘ π
M-S
ize
Vec
tor
1 2 3 N...
w1 w2 w3 wN...
Erase:ππ‘ π = ππ‘β1 π 1 β π€π‘ π ππ‘
Add:ππ‘ π += π€π‘ π ππ‘
π€π‘ π : vector of weights over the N locations
emitted by a read head at time t
ππ‘: Erase Vector, emitted by the header
ππ‘: Add Vector, emitted by the header
π€π‘: weights vector
Attention & Focus
β’Attention and Focus is adjusted by weights vector across
whole memory bank!
4/6/2015 Tinghui Wang, EECS, WSU 14
Content Based Addressing
β’ Find Similar Data in memory
4/6/2015 Tinghui Wang, EECS, WSU 15
ππ‘: Key Vector
π½π‘: Key Strength
π€π‘π π =
exp π½π‘πΎ ππ‘, π΄π‘ π
π exp π½π‘πΎ ππ‘, π΄π‘ π
πΎ π, π =π β π
π β π
May not depend on content onlyβ¦
β’Gating Content Addressing
4/6/2015 Tinghui Wang, EECS, WSU 16
ππ‘: Interpolation gate in the range of (0,1)
ππ‘π= ππ‘ππ‘
π + 1 β ππ‘ ππ‘β1
Convolutional Shift β Location Addressing
4/6/2015 Tinghui Wang, EECS, WSU 17
ππ‘: Shift weighting β Convolutional Vector (size N)
ππ‘ π =
π=0
πβ1
ππ‘ππ ππ‘ π β π
Convolutional Shift β Location Addressing
4/6/2015 Tinghui Wang, EECS, WSU 18
πΎπ‘: sharpening factor
ππ‘ π = ππ‘ π
πΎπ‘
π ππ‘ ππΎπ‘
Neural Turing Machine - Experiments
β’ Goal: to demonstrate NTM is
Able to solve the problems
By learning compact internal programs
β’ Three architectures:
NTM with a feedforward controller
NTM with an LSTM controller
Standard LSTM controller
β’ Applications
Copy
Repeat Copy
Associative Recall
Dynamic N-Grams
Priority Sort
4/6/2015 Tinghui Wang, EECS, WSU 19
NTM Experiments: Copy
β’ Task: Store and recall a long sequence
of arbitrary information
β’ Training: 8-bit random vectors with
length 1-20
β’ No inputs while it generates the targets
4/6/2015 Tinghui Wang, EECS, WSU 20
NTM
LSTM
NTM Experiments: Copy β Learning Curve
4/6/2015 Tinghui Wang, EECS, WSU 21
NTM Experiments: Repeated Copy
4/6/2015 Tinghui Wang, EECS, WSU 22
β’ Task: Repeat a sequence a specified
number of times
β’ Training: random length sequences of
random binary vectors followed by a
scalar value indicating the desired
number of copies
NTM Experiments: Repeated Copy β Learning Curve
4/6/2015 Tinghui Wang, EECS, WSU 23
NTM Experiments: Associative Recall
4/6/2015 Tinghui Wang, EECS, WSU 24
β’ Task: ask the network to produce the next item,
given current item after propagating a sequence to
network
β’ Training: each item is composed of three six-bit
binary vectors, 2-8 items every episode
NTM Experiments: Dynamic N-Grams
β’ Task: Learn N-Gram Model β rapidly
adapt to new predictive distribution
β’ Training: 6-Gram distributions over
binary sequences. 200 successive
bits using look-up table by drawing
32 probabilities from Beta(.5,.5)
distribution
β’ Compare to Optimal Estimator:
π π΅ = 1 π1, π2, π =π1 + 0.5
π1 + π0 + 1
4/6/2015 Tinghui Wang, EECS, WSU 25
C=01111 C=00010
NTM Experiments: Priority Sorting
β’ Task: Sort Binary Vector based on priority
β’ Training: 20 binary vectors with
corresponding priorities, output 16
highest-priority vectors
4/6/2015 Tinghui Wang, EECS, WSU 26
NTM Experiments: Learning Curve
4/6/2015 Tinghui Wang, EECS, WSU 27
Some Detailed Parameters
4/6/2015 Tinghui Wang, EECS, WSU 28
NTM with
Feed Forward
Neural Network
NTM with
LSTM
LSTM
Additional Information
β’ Literature Reference:
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533β536
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735β1780
Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou, ., Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007.
β’ NTM Implementation
https://github.com/shawntan/neural-turing-machines
β’ NTM Reddit Discussion https://www.reddit.com/r/MachineLearning/comments/2m9mga/implementation_of_neural_turing_ma
chines/
4/6/2015 Tinghui Wang, EECS, WSU 29
4/6/2015 Tinghui Wang, EECS, WSU 30