Transcript
Page 1: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Neural Turing Machine

Author: Alex Graves, Greg Wayne, Ivo Danihelka

Presented By: Tinghui Wang (Steve)

Page 2: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Introduction

β€’Neural Turning Machine: Couple a Neural Network with

external memory resources

β€’ The combined system is analogous to TM, but differentiable

from end to end

β€’NTM Can infer simple algorithms!

4/6/2015 Tinghui Wang, EECS, WSU 2

Page 3: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

In Computation Theory…

Finite State Machine

𝚺, 𝑺, π’”πŸŽ, 𝜹, 𝑭

β€’πšΊ: Alphabet

‒𝑺: Finite set of states

β€’π’”πŸŽ: Initial state

β€’πœΉ: state transition function:

𝑺 Γ— 𝚺 β†’ 𝑷 𝑺

‒𝑭: set of Final States

4/6/2015 Tinghui Wang, EECS, WSU 3

Page 4: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Push Down Automata

𝑸, 𝚺, πšͺ, 𝜹, π’’πŸŽ, 𝒁, 𝑭

β€’πšΊ: Input Alphabet

β€’πšͺ: Stack Alphabet

‒𝑸: Finite set of states

β€’π’’πŸŽ: Initial state

β€’πœΉ: state transition function:

𝑸 Γ— 𝚺 Γ— πšͺ β†’ 𝑷 𝑸 Γ— πšͺ

‒𝑭: set of Final States

In Computation Theory…

4/6/2015 Tinghui Wang, EECS, WSU 4

Page 5: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Turing Machine

𝑸, πšͺ, 𝒃, 𝚺, 𝜹, π’’πŸŽ, 𝑭

‒𝑸: Finite set of States

β€’π’’πŸŽ: Initial State

β€’πšͺ: Tape alphabet (Finite)

‒𝒃: Blank Symbol (occurs

infinitely on the tape)

β€’πšΊ: Input Alphabet (Ξ£ = πšͺ\ 𝑏 )

β€’πœΉ: Transition Function

𝑄\F Γ— Ξ“ β†’ 𝑄 Γ— Ξ“ Γ— 𝐿, 𝑅

‒𝑭: Final State (Accepted)

4/6/2015 Tinghui Wang, EECS, WSU 5

Page 6: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Neural Turing Machine - Add Learning to TM!!

4/6/2015 Tinghui Wang, EECS, WSU 6

Finite State Machine (Program) Neural Network (I Can Learn!!)

Page 7: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

A Little History on Neural Network

β€’ 1950s: Frank Rosenblatt, Perceptron – classification based

on linear predictor

β€’ 1969: Minsky – Proof that Perceptron Sucks! – Not able to

learn XOR

β€’ 1980s: Back propagation

β€’ 1990s: Recurrent Neural Network, Long Short-Term Memory

β€’ Late 2000s – now: Deep Learning, Fast Computer

4/6/2015 Tinghui Wang, EECS, WSU 7

Page 8: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Feed-Forward Neural Net and Back Propagation

4/6/2015 Tinghui Wang, EECS, WSU 8

Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October

1986). "Learning representations by back-propagating

errors". Nature 323 (6088): 533–536.

β€’ Total Input of unit 𝑗:

π‘₯𝑗 =

𝑖

𝑦𝑖𝑀𝑗𝑖

β€’ Output of unit 𝑗 (logistic function)

𝑦𝑗 =1

1 + π‘’βˆ’π‘₯𝑗

β€’ Total Error (mean square)

πΈπ‘‘π‘œπ‘‘π‘Žπ‘™ =1

2

𝑐

𝑗

𝑑𝑐,𝑗 βˆ’ 𝑦𝑐,𝑗2

β€’ Gradient Descent

πœ•πΈ

πœ•π‘€π‘—π‘–=

πœ•πΈ

πœ•π‘¦π‘—

πœ•π‘¦π‘—

πœ•π‘₯𝑗

πœ•π‘₯𝑗

πœ•π‘€π‘—π‘–

𝑗𝑖𝑀𝑗𝑖

Page 9: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Recurrent Neural Network

β€’ 𝑦 𝑑 : n-tuple of outputs at time t

β€’ π‘₯𝑛𝑒𝑑 𝑑 : m-tuple of external inputs at time t

β€’ π‘ˆ: set of indices π‘˜ that π‘₯π‘˜ is output of unit in

network

β€’ 𝐼: set of indices π‘˜ that π‘₯π‘˜ is external input

β€’ π‘₯π‘˜ = π‘₯π‘˜π‘›π‘’π‘‘ 𝑑 , 𝑖𝑓 π‘˜ ∈ 𝐼

π‘¦π‘˜ 𝑑 , 𝑖𝑓 π‘˜ ∈ 𝑅

β€’ π‘¦π‘˜ 𝑑 + 1 = π‘“π‘˜ π‘ π‘˜ 𝑑 + 1 where π‘“π‘˜ is a differential

squashing function

β€’ π‘ π‘˜ 𝑑 + 1 = π‘–βˆˆπΌβˆͺπ‘ˆπ‘€π‘˜π‘–π‘₯𝑖 𝑑

4/6/2015 Tinghui Wang, EECS, WSU 9

R. J. Williams and D. Zipser. Gradient-based learning

algorithms for recurrent networks and their computational

complexity. In Back-propagation: Theory, Architectures and

Applications. Hillsdale, NJ: Erlbaum, 1994.

Page 10: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Recurrent Neural Network – Time Unrolling

β€’ Error Function:

𝐽 𝑑 = βˆ’1

2

π‘˜βˆˆπ‘ˆ

π‘’π‘˜ 𝑑 2 = βˆ’1

2

π‘˜βˆˆπ‘ˆ

π‘‘π‘˜ 𝑑 βˆ’ π‘¦π‘˜ 𝑑 2

π½π‘‡π‘œπ‘‘π‘Žπ‘™ 𝑑′, 𝑑 = 𝜏=𝑑′+1

𝑑

𝐽 𝜏

β€’ Back Propagation through Time:

νœ€π‘˜ 𝑑 =πœ•π½ 𝑑

πœ•π‘¦π‘˜ 𝑑= π‘’π‘˜ 𝑑 = π‘‘π‘˜ 𝑑 βˆ’ π‘¦π‘˜ 𝑑

π›Ώπ‘˜ 𝜏 =πœ•π½ 𝑑

πœ•π‘ π‘˜ 𝜏=

πœ•π½ 𝑑

πœ•π‘¦π‘˜ 𝜏

πœ•π‘¦π‘˜ 𝜏

πœ•π‘ π‘˜ 𝜏= π‘“π‘˜

β€² π‘ π‘˜ 𝜏 νœ€π‘˜ 𝜏

νœ€π‘˜ 𝜏 βˆ’ 1 =πœ•π½ 𝑑

πœ•π‘¦π‘˜ 𝜏 βˆ’ 1=

π‘™βˆˆπ‘ˆ

πœ•π½ 𝑑

πœ•π‘ π‘™ 𝜏

πœ•π‘ π‘™ 𝜏

πœ•π‘¦π‘˜ 𝜏 βˆ’ 1=

π‘™βˆˆπ‘ˆ

π‘€π‘™π‘˜π›Ώπ‘™ 𝜏

πœ•π½π‘‡π‘œπ‘‘π‘Žπ‘™

πœ•π‘€π‘—π‘–=

𝜏=𝑑′+1

𝑑 πœ•π½ 𝑑

πœ•π‘ π‘— 𝜏

πœ•π‘ π‘— 𝜏

πœ•π‘€π‘—π‘–4/6/2015 Tinghui Wang, EECS, WSU 10

Page 11: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

RNN & Long Short-Term Memory

β€’ RNN is Turing Complete

With Proper Configuration, it can simulate any sequence generated by a

Turing Machine

β€’ Error Vanishing and Exploding

Consider single unit self loop:

π›Ώπ‘˜ 𝜏 =πœ•π½ 𝑑

πœ•π‘ π‘˜ 𝜏=

πœ•π½ 𝑑

πœ•π‘¦π‘˜ 𝜏

πœ•π‘¦π‘˜ 𝜏

πœ•π‘ π‘˜ 𝜏= π‘“π‘˜

β€² π‘ π‘˜ 𝜏 νœ€π‘˜ 𝜏 = π‘“π‘˜β€² π‘ π‘˜ 𝜏 π‘€π‘˜π‘˜π›Ώπ‘˜ 𝜏 + 1

β€’ Long Short-term Memory

4/6/2015 Tinghui Wang, EECS, WSU 11

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.

Page 12: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Purpose of Neural Turing Machine

β€’ Enrich the capabilities of standard RNN to simplify the

solution of algorithmic tasks

4/6/2015 Tinghui Wang, EECS, WSU 12

Page 13: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Read/Write Operation

4/6/2015 Tinghui Wang, EECS, WSU 13

M-S

ize

Vec

tor

1 2 3 N...

w1 w2 w3 wN...

π‘Ÿπ‘‘ =

𝑖

𝑀𝑑 𝑖 𝑀𝑑 𝑖

M-S

ize

Vec

tor

1 2 3 N...

w1 w2 w3 wN...

Erase:𝑀𝑑 𝑖 = π‘€π‘‘βˆ’1 𝑖 1 βˆ’ 𝑀𝑑 𝑖 𝑒𝑑

Add:𝑀𝑑 𝑖 += 𝑀𝑑 𝑖 π‘Žπ‘‘

𝑀𝑑 𝑖 : vector of weights over the N locations

emitted by a read head at time t

𝑒𝑑: Erase Vector, emitted by the header

π‘Žπ‘‘: Add Vector, emitted by the header

𝑀𝑑: weights vector

Page 14: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Attention & Focus

β€’Attention and Focus is adjusted by weights vector across

whole memory bank!

4/6/2015 Tinghui Wang, EECS, WSU 14

Page 15: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Content Based Addressing

β€’ Find Similar Data in memory

4/6/2015 Tinghui Wang, EECS, WSU 15

π’Œπ‘‘: Key Vector

𝛽𝑑: Key Strength

𝑀𝑑𝑐 𝑖 =

exp 𝛽𝑑𝐾 π’Œπ‘‘, 𝑴𝑑 𝑖

𝑗 exp 𝛽𝑑𝐾 π’Œπ‘‘, 𝑴𝑑 𝑗

𝐾 𝒖, 𝒗 =𝒖 βˆ™ 𝒗

𝒖 βˆ™ 𝒗

Page 16: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

May not depend on content only…

β€’Gating Content Addressing

4/6/2015 Tinghui Wang, EECS, WSU 16

𝑔𝑑: Interpolation gate in the range of (0,1)

π’˜π‘‘π‘”= π‘”π‘‘π’˜π‘‘

𝑐 + 1 βˆ’ 𝑔𝑑 π’˜π‘‘βˆ’1

Page 17: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Convolutional Shift – Location Addressing

4/6/2015 Tinghui Wang, EECS, WSU 17

𝒔𝑑: Shift weighting – Convolutional Vector (size N)

π’˜π‘‘ 𝑖 =

𝑗=0

π‘βˆ’1

π’˜π‘‘π‘”π‘— 𝒔𝑑 𝑖 βˆ’ 𝑗

Page 18: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Convolutional Shift – Location Addressing

4/6/2015 Tinghui Wang, EECS, WSU 18

𝛾𝑑: sharpening factor

π’˜π‘‘ 𝑖 = π’˜π‘‘ 𝑖

𝛾𝑑

𝑗 π’˜π‘‘ 𝑗𝛾𝑑

Page 19: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Neural Turing Machine - Experiments

β€’ Goal: to demonstrate NTM is

Able to solve the problems

By learning compact internal programs

β€’ Three architectures:

NTM with a feedforward controller

NTM with an LSTM controller

Standard LSTM controller

β€’ Applications

Copy

Repeat Copy

Associative Recall

Dynamic N-Grams

Priority Sort

4/6/2015 Tinghui Wang, EECS, WSU 19

Page 20: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Copy

β€’ Task: Store and recall a long sequence

of arbitrary information

β€’ Training: 8-bit random vectors with

length 1-20

β€’ No inputs while it generates the targets

4/6/2015 Tinghui Wang, EECS, WSU 20

NTM

LSTM

Page 21: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Copy – Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 21

Page 22: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Repeated Copy

4/6/2015 Tinghui Wang, EECS, WSU 22

β€’ Task: Repeat a sequence a specified

number of times

β€’ Training: random length sequences of

random binary vectors followed by a

scalar value indicating the desired

number of copies

Page 23: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Repeated Copy – Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 23

Page 24: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Associative Recall

4/6/2015 Tinghui Wang, EECS, WSU 24

β€’ Task: ask the network to produce the next item,

given current item after propagating a sequence to

network

β€’ Training: each item is composed of three six-bit

binary vectors, 2-8 items every episode

Page 25: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Dynamic N-Grams

β€’ Task: Learn N-Gram Model – rapidly

adapt to new predictive distribution

β€’ Training: 6-Gram distributions over

binary sequences. 200 successive

bits using look-up table by drawing

32 probabilities from Beta(.5,.5)

distribution

β€’ Compare to Optimal Estimator:

𝑃 𝐡 = 1 𝑁1, 𝑁2, 𝒄 =𝑁1 + 0.5

𝑁1 + 𝑁0 + 1

4/6/2015 Tinghui Wang, EECS, WSU 25

C=01111 C=00010

Page 26: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Priority Sorting

β€’ Task: Sort Binary Vector based on priority

β€’ Training: 20 binary vectors with

corresponding priorities, output 16

highest-priority vectors

4/6/2015 Tinghui Wang, EECS, WSU 26

Page 27: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 27

Page 28: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Some Detailed Parameters

4/6/2015 Tinghui Wang, EECS, WSU 28

NTM with

Feed Forward

Neural Network

NTM with

LSTM

LSTM

Page 29: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Additional Information

β€’ Literature Reference:

Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536

R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780

Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou, ., Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007.

β€’ NTM Implementation

https://github.com/shawntan/neural-turing-machines

β€’ NTM Reddit Discussion https://www.reddit.com/r/MachineLearning/comments/2m9mga/implementation_of_neural_turing_ma

chines/

4/6/2015 Tinghui Wang, EECS, WSU 29

Page 30: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

4/6/2015 Tinghui Wang, EECS, WSU 30


Top Related