neural turing machine - washington state universitycook/aiseminar/papers/steve.pdf4/6/2015 tinghui...
TRANSCRIPT
Neural Turing Machine
Author: Alex Graves, Greg Wayne, Ivo Danihelka
Presented By: Tinghui Wang (Steve)
Introduction
โขNeural Turning Machine: Couple a Neural Network with
external memory resources
โข The combined system is analogous to TM, but differentiable
from end to end
โขNTM Can infer simple algorithms!
4/6/2015 Tinghui Wang, EECS, WSU 2
In Computation Theoryโฆ
Finite State Machine
๐บ, ๐บ, ๐๐, ๐น, ๐ญ
โข๐บ: Alphabet
โข๐บ: Finite set of states
โข๐๐: Initial state
โข๐น: state transition function:
๐บ ร ๐บ โ ๐ท ๐บ
โข๐ญ: set of Final States
4/6/2015 Tinghui Wang, EECS, WSU 3
Push Down Automata
๐ธ, ๐บ, ๐ช, ๐น, ๐๐, ๐, ๐ญ
โข๐บ: Input Alphabet
โข๐ช: Stack Alphabet
โข๐ธ: Finite set of states
โข๐๐: Initial state
โข๐น: state transition function:
๐ธ ร ๐บ ร ๐ช โ ๐ท ๐ธ ร ๐ช
โข๐ญ: set of Final States
In Computation Theoryโฆ
4/6/2015 Tinghui Wang, EECS, WSU 4
Turing Machine
๐ธ, ๐ช, ๐, ๐บ, ๐น, ๐๐, ๐ญ
โข๐ธ: Finite set of States
โข๐๐: Initial State
โข๐ช: Tape alphabet (Finite)
โข๐: Blank Symbol (occurs
infinitely on the tape)
โข๐บ: Input Alphabet (ฮฃ = ๐ช\ ๐ )
โข๐น: Transition Function
๐\F ร ฮ โ ๐ ร ฮ ร ๐ฟ, ๐
โข๐ญ: Final State (Accepted)
4/6/2015 Tinghui Wang, EECS, WSU 5
Neural Turing Machine - Add Learning to TM!!
4/6/2015 Tinghui Wang, EECS, WSU 6
Finite State Machine (Program) Neural Network (I Can Learn!!)
A Little History on Neural Network
โข 1950s: Frank Rosenblatt, Perceptron โ classification based
on linear predictor
โข 1969: Minsky โ Proof that Perceptron Sucks! โ Not able to
learn XOR
โข 1980s: Back propagation
โข 1990s: Recurrent Neural Network, Long Short-Term Memory
โข Late 2000s โ now: Deep Learning, Fast Computer
4/6/2015 Tinghui Wang, EECS, WSU 7
Feed-Forward Neural Net and Back Propagation
4/6/2015 Tinghui Wang, EECS, WSU 8
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October
1986). "Learning representations by back-propagating
errors". Nature 323 (6088): 533โ536.
โข Total Input of unit ๐:
๐ฅ๐ =
๐
๐ฆ๐๐ค๐๐
โข Output of unit ๐ (logistic function)
๐ฆ๐ =1
1 + ๐โ๐ฅ๐
โข Total Error (mean square)
๐ธ๐ก๐๐ก๐๐ =1
2
๐
๐
๐๐,๐ โ ๐ฆ๐,๐2
โข Gradient Descent
๐๐ธ
๐๐ค๐๐=
๐๐ธ
๐๐ฆ๐
๐๐ฆ๐
๐๐ฅ๐
๐๐ฅ๐
๐๐ค๐๐
๐๐๐ค๐๐
Recurrent Neural Network
โข ๐ฆ ๐ก : n-tuple of outputs at time t
โข ๐ฅ๐๐๐ก ๐ก : m-tuple of external inputs at time t
โข ๐: set of indices ๐ that ๐ฅ๐ is output of unit in
network
โข ๐ผ: set of indices ๐ that ๐ฅ๐ is external input
โข ๐ฅ๐ = ๐ฅ๐๐๐๐ก ๐ก , ๐๐ ๐ โ ๐ผ
๐ฆ๐ ๐ก , ๐๐ ๐ โ ๐
โข ๐ฆ๐ ๐ก + 1 = ๐๐ ๐ ๐ ๐ก + 1 where ๐๐ is a differential
squashing function
โข ๐ ๐ ๐ก + 1 = ๐โ๐ผโช๐๐ค๐๐๐ฅ๐ ๐ก
4/6/2015 Tinghui Wang, EECS, WSU 9
R. J. Williams and D. Zipser. Gradient-based learning
algorithms for recurrent networks and their computational
complexity. In Back-propagation: Theory, Architectures and
Applications. Hillsdale, NJ: Erlbaum, 1994.
Recurrent Neural Network โ Time Unrolling
โข Error Function:
๐ฝ ๐ก = โ1
2
๐โ๐
๐๐ ๐ก 2 = โ1
2
๐โ๐
๐๐ ๐ก โ ๐ฆ๐ ๐ก 2
๐ฝ๐๐๐ก๐๐ ๐กโฒ, ๐ก = ๐=๐กโฒ+1
๐ก
๐ฝ ๐
โข Back Propagation through Time:
ํ๐ ๐ก =๐๐ฝ ๐ก
๐๐ฆ๐ ๐ก= ๐๐ ๐ก = ๐๐ ๐ก โ ๐ฆ๐ ๐ก
๐ฟ๐ ๐ =๐๐ฝ ๐ก
๐๐ ๐ ๐=
๐๐ฝ ๐ก
๐๐ฆ๐ ๐
๐๐ฆ๐ ๐
๐๐ ๐ ๐= ๐๐
โฒ ๐ ๐ ๐ ํ๐ ๐
ํ๐ ๐ โ 1 =๐๐ฝ ๐ก
๐๐ฆ๐ ๐ โ 1=
๐โ๐
๐๐ฝ ๐ก
๐๐ ๐ ๐
๐๐ ๐ ๐
๐๐ฆ๐ ๐ โ 1=
๐โ๐
๐ค๐๐๐ฟ๐ ๐
๐๐ฝ๐๐๐ก๐๐
๐๐ค๐๐=
๐=๐กโฒ+1
๐ก ๐๐ฝ ๐ก
๐๐ ๐ ๐
๐๐ ๐ ๐
๐๐ค๐๐4/6/2015 Tinghui Wang, EECS, WSU 10
RNN & Long Short-Term Memory
โข RNN is Turing Complete
With Proper Configuration, it can simulate any sequence generated by a
Turing Machine
โข Error Vanishing and Exploding
Consider single unit self loop:
๐ฟ๐ ๐ =๐๐ฝ ๐ก
๐๐ ๐ ๐=
๐๐ฝ ๐ก
๐๐ฆ๐ ๐
๐๐ฆ๐ ๐
๐๐ ๐ ๐= ๐๐
โฒ ๐ ๐ ๐ ํ๐ ๐ = ๐๐โฒ ๐ ๐ ๐ ๐ค๐๐๐ฟ๐ ๐ + 1
โข Long Short-term Memory
4/6/2015 Tinghui Wang, EECS, WSU 11
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735โ1780.
Purpose of Neural Turing Machine
โข Enrich the capabilities of standard RNN to simplify the
solution of algorithmic tasks
4/6/2015 Tinghui Wang, EECS, WSU 12
NTM Read/Write Operation
4/6/2015 Tinghui Wang, EECS, WSU 13
M-S
ize
Vec
tor
1 2 3 N...
w1 w2 w3 wN...
๐๐ก =
๐
๐ค๐ก ๐ ๐๐ก ๐
M-S
ize
Vec
tor
1 2 3 N...
w1 w2 w3 wN...
Erase:๐๐ก ๐ = ๐๐กโ1 ๐ 1 โ ๐ค๐ก ๐ ๐๐ก
Add:๐๐ก ๐ += ๐ค๐ก ๐ ๐๐ก
๐ค๐ก ๐ : vector of weights over the N locations
emitted by a read head at time t
๐๐ก: Erase Vector, emitted by the header
๐๐ก: Add Vector, emitted by the header
๐ค๐ก: weights vector
Attention & Focus
โขAttention and Focus is adjusted by weights vector across
whole memory bank!
4/6/2015 Tinghui Wang, EECS, WSU 14
Content Based Addressing
โข Find Similar Data in memory
4/6/2015 Tinghui Wang, EECS, WSU 15
๐๐ก: Key Vector
๐ฝ๐ก: Key Strength
๐ค๐ก๐ ๐ =
exp ๐ฝ๐ก๐พ ๐๐ก, ๐ด๐ก ๐
๐ exp ๐ฝ๐ก๐พ ๐๐ก, ๐ด๐ก ๐
๐พ ๐, ๐ =๐ โ ๐
๐ โ ๐
May not depend on content onlyโฆ
โขGating Content Addressing
4/6/2015 Tinghui Wang, EECS, WSU 16
๐๐ก: Interpolation gate in the range of (0,1)
๐๐ก๐= ๐๐ก๐๐ก
๐ + 1 โ ๐๐ก ๐๐กโ1
Convolutional Shift โ Location Addressing
4/6/2015 Tinghui Wang, EECS, WSU 17
๐๐ก: Shift weighting โ Convolutional Vector (size N)
๐๐ก ๐ =
๐=0
๐โ1
๐๐ก๐๐ ๐๐ก ๐ โ ๐
Convolutional Shift โ Location Addressing
4/6/2015 Tinghui Wang, EECS, WSU 18
๐พ๐ก: sharpening factor
๐๐ก ๐ = ๐๐ก ๐
๐พ๐ก
๐ ๐๐ก ๐๐พ๐ก
Neural Turing Machine - Experiments
โข Goal: to demonstrate NTM is
Able to solve the problems
By learning compact internal programs
โข Three architectures:
NTM with a feedforward controller
NTM with an LSTM controller
Standard LSTM controller
โข Applications
Copy
Repeat Copy
Associative Recall
Dynamic N-Grams
Priority Sort
4/6/2015 Tinghui Wang, EECS, WSU 19
NTM Experiments: Copy
โข Task: Store and recall a long sequence
of arbitrary information
โข Training: 8-bit random vectors with
length 1-20
โข No inputs while it generates the targets
4/6/2015 Tinghui Wang, EECS, WSU 20
NTM
LSTM
NTM Experiments: Copy โ Learning Curve
4/6/2015 Tinghui Wang, EECS, WSU 21
NTM Experiments: Repeated Copy
4/6/2015 Tinghui Wang, EECS, WSU 22
โข Task: Repeat a sequence a specified
number of times
โข Training: random length sequences of
random binary vectors followed by a
scalar value indicating the desired
number of copies
NTM Experiments: Repeated Copy โ Learning Curve
4/6/2015 Tinghui Wang, EECS, WSU 23
NTM Experiments: Associative Recall
4/6/2015 Tinghui Wang, EECS, WSU 24
โข Task: ask the network to produce the next item,
given current item after propagating a sequence to
network
โข Training: each item is composed of three six-bit
binary vectors, 2-8 items every episode
NTM Experiments: Dynamic N-Grams
โข Task: Learn N-Gram Model โ rapidly
adapt to new predictive distribution
โข Training: 6-Gram distributions over
binary sequences. 200 successive
bits using look-up table by drawing
32 probabilities from Beta(.5,.5)
distribution
โข Compare to Optimal Estimator:
๐ ๐ต = 1 ๐1, ๐2, ๐ =๐1 + 0.5
๐1 + ๐0 + 1
4/6/2015 Tinghui Wang, EECS, WSU 25
C=01111 C=00010
NTM Experiments: Priority Sorting
โข Task: Sort Binary Vector based on priority
โข Training: 20 binary vectors with
corresponding priorities, output 16
highest-priority vectors
4/6/2015 Tinghui Wang, EECS, WSU 26
NTM Experiments: Learning Curve
4/6/2015 Tinghui Wang, EECS, WSU 27
Some Detailed Parameters
4/6/2015 Tinghui Wang, EECS, WSU 28
NTM with
Feed Forward
Neural Network
NTM with
LSTM
LSTM
Additional Information
โข Literature Reference:
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533โ536
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735โ1780
Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou, ., Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007.
โข NTM Implementation
https://github.com/shawntan/neural-turing-machines
โข NTM Reddit Discussion https://www.reddit.com/r/MachineLearning/comments/2m9mga/implementation_of_neural_turing_ma
chines/
4/6/2015 Tinghui Wang, EECS, WSU 29
4/6/2015 Tinghui Wang, EECS, WSU 30