neural turing machine - washington state universitycook/aiseminar/papers/steve.pdf4/6/2015 tinghui...

30
Neural Turing Machine Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Upload: others

Post on 05-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Neural Turing Machine

Author: Alex Graves, Greg Wayne, Ivo Danihelka

Presented By: Tinghui Wang (Steve)

Page 2: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Introduction

โ€ขNeural Turning Machine: Couple a Neural Network with

external memory resources

โ€ข The combined system is analogous to TM, but differentiable

from end to end

โ€ขNTM Can infer simple algorithms!

4/6/2015 Tinghui Wang, EECS, WSU 2

Page 3: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

In Computation Theoryโ€ฆ

Finite State Machine

๐šบ, ๐‘บ, ๐’”๐ŸŽ, ๐œน, ๐‘ญ

โ€ข๐šบ: Alphabet

โ€ข๐‘บ: Finite set of states

โ€ข๐’”๐ŸŽ: Initial state

โ€ข๐œน: state transition function:

๐‘บ ร— ๐šบ โ†’ ๐‘ท ๐‘บ

โ€ข๐‘ญ: set of Final States

4/6/2015 Tinghui Wang, EECS, WSU 3

Page 4: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Push Down Automata

๐‘ธ, ๐šบ, ๐šช, ๐œน, ๐’’๐ŸŽ, ๐’, ๐‘ญ

โ€ข๐šบ: Input Alphabet

โ€ข๐šช: Stack Alphabet

โ€ข๐‘ธ: Finite set of states

โ€ข๐’’๐ŸŽ: Initial state

โ€ข๐œน: state transition function:

๐‘ธ ร— ๐šบ ร— ๐šช โ†’ ๐‘ท ๐‘ธ ร— ๐šช

โ€ข๐‘ญ: set of Final States

In Computation Theoryโ€ฆ

4/6/2015 Tinghui Wang, EECS, WSU 4

Page 5: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Turing Machine

๐‘ธ, ๐šช, ๐’ƒ, ๐šบ, ๐œน, ๐’’๐ŸŽ, ๐‘ญ

โ€ข๐‘ธ: Finite set of States

โ€ข๐’’๐ŸŽ: Initial State

โ€ข๐šช: Tape alphabet (Finite)

โ€ข๐’ƒ: Blank Symbol (occurs

infinitely on the tape)

โ€ข๐šบ: Input Alphabet (ฮฃ = ๐šช\ ๐‘ )

โ€ข๐œน: Transition Function

๐‘„\F ร— ฮ“ โ†’ ๐‘„ ร— ฮ“ ร— ๐ฟ, ๐‘…

โ€ข๐‘ญ: Final State (Accepted)

4/6/2015 Tinghui Wang, EECS, WSU 5

Page 6: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Neural Turing Machine - Add Learning to TM!!

4/6/2015 Tinghui Wang, EECS, WSU 6

Finite State Machine (Program) Neural Network (I Can Learn!!)

Page 7: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

A Little History on Neural Network

โ€ข 1950s: Frank Rosenblatt, Perceptron โ€“ classification based

on linear predictor

โ€ข 1969: Minsky โ€“ Proof that Perceptron Sucks! โ€“ Not able to

learn XOR

โ€ข 1980s: Back propagation

โ€ข 1990s: Recurrent Neural Network, Long Short-Term Memory

โ€ข Late 2000s โ€“ now: Deep Learning, Fast Computer

4/6/2015 Tinghui Wang, EECS, WSU 7

Page 8: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Feed-Forward Neural Net and Back Propagation

4/6/2015 Tinghui Wang, EECS, WSU 8

Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October

1986). "Learning representations by back-propagating

errors". Nature 323 (6088): 533โ€“536.

โ€ข Total Input of unit ๐‘—:

๐‘ฅ๐‘— =

๐‘–

๐‘ฆ๐‘–๐‘ค๐‘—๐‘–

โ€ข Output of unit ๐‘— (logistic function)

๐‘ฆ๐‘— =1

1 + ๐‘’โˆ’๐‘ฅ๐‘—

โ€ข Total Error (mean square)

๐ธ๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ =1

2

๐‘

๐‘—

๐‘‘๐‘,๐‘— โˆ’ ๐‘ฆ๐‘,๐‘—2

โ€ข Gradient Descent

๐œ•๐ธ

๐œ•๐‘ค๐‘—๐‘–=

๐œ•๐ธ

๐œ•๐‘ฆ๐‘—

๐œ•๐‘ฆ๐‘—

๐œ•๐‘ฅ๐‘—

๐œ•๐‘ฅ๐‘—

๐œ•๐‘ค๐‘—๐‘–

๐‘—๐‘–๐‘ค๐‘—๐‘–

Page 9: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Recurrent Neural Network

โ€ข ๐‘ฆ ๐‘ก : n-tuple of outputs at time t

โ€ข ๐‘ฅ๐‘›๐‘’๐‘ก ๐‘ก : m-tuple of external inputs at time t

โ€ข ๐‘ˆ: set of indices ๐‘˜ that ๐‘ฅ๐‘˜ is output of unit in

network

โ€ข ๐ผ: set of indices ๐‘˜ that ๐‘ฅ๐‘˜ is external input

โ€ข ๐‘ฅ๐‘˜ = ๐‘ฅ๐‘˜๐‘›๐‘’๐‘ก ๐‘ก , ๐‘–๐‘“ ๐‘˜ โˆˆ ๐ผ

๐‘ฆ๐‘˜ ๐‘ก , ๐‘–๐‘“ ๐‘˜ โˆˆ ๐‘…

โ€ข ๐‘ฆ๐‘˜ ๐‘ก + 1 = ๐‘“๐‘˜ ๐‘ ๐‘˜ ๐‘ก + 1 where ๐‘“๐‘˜ is a differential

squashing function

โ€ข ๐‘ ๐‘˜ ๐‘ก + 1 = ๐‘–โˆˆ๐ผโˆช๐‘ˆ๐‘ค๐‘˜๐‘–๐‘ฅ๐‘– ๐‘ก

4/6/2015 Tinghui Wang, EECS, WSU 9

R. J. Williams and D. Zipser. Gradient-based learning

algorithms for recurrent networks and their computational

complexity. In Back-propagation: Theory, Architectures and

Applications. Hillsdale, NJ: Erlbaum, 1994.

Page 10: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Recurrent Neural Network โ€“ Time Unrolling

โ€ข Error Function:

๐ฝ ๐‘ก = โˆ’1

2

๐‘˜โˆˆ๐‘ˆ

๐‘’๐‘˜ ๐‘ก 2 = โˆ’1

2

๐‘˜โˆˆ๐‘ˆ

๐‘‘๐‘˜ ๐‘ก โˆ’ ๐‘ฆ๐‘˜ ๐‘ก 2

๐ฝ๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘กโ€ฒ, ๐‘ก = ๐œ=๐‘กโ€ฒ+1

๐‘ก

๐ฝ ๐œ

โ€ข Back Propagation through Time:

ํœ€๐‘˜ ๐‘ก =๐œ•๐ฝ ๐‘ก

๐œ•๐‘ฆ๐‘˜ ๐‘ก= ๐‘’๐‘˜ ๐‘ก = ๐‘‘๐‘˜ ๐‘ก โˆ’ ๐‘ฆ๐‘˜ ๐‘ก

๐›ฟ๐‘˜ ๐œ =๐œ•๐ฝ ๐‘ก

๐œ•๐‘ ๐‘˜ ๐œ=

๐œ•๐ฝ ๐‘ก

๐œ•๐‘ฆ๐‘˜ ๐œ

๐œ•๐‘ฆ๐‘˜ ๐œ

๐œ•๐‘ ๐‘˜ ๐œ= ๐‘“๐‘˜

โ€ฒ ๐‘ ๐‘˜ ๐œ ํœ€๐‘˜ ๐œ

ํœ€๐‘˜ ๐œ โˆ’ 1 =๐œ•๐ฝ ๐‘ก

๐œ•๐‘ฆ๐‘˜ ๐œ โˆ’ 1=

๐‘™โˆˆ๐‘ˆ

๐œ•๐ฝ ๐‘ก

๐œ•๐‘ ๐‘™ ๐œ

๐œ•๐‘ ๐‘™ ๐œ

๐œ•๐‘ฆ๐‘˜ ๐œ โˆ’ 1=

๐‘™โˆˆ๐‘ˆ

๐‘ค๐‘™๐‘˜๐›ฟ๐‘™ ๐œ

๐œ•๐ฝ๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™

๐œ•๐‘ค๐‘—๐‘–=

๐œ=๐‘กโ€ฒ+1

๐‘ก ๐œ•๐ฝ ๐‘ก

๐œ•๐‘ ๐‘— ๐œ

๐œ•๐‘ ๐‘— ๐œ

๐œ•๐‘ค๐‘—๐‘–4/6/2015 Tinghui Wang, EECS, WSU 10

Page 11: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

RNN & Long Short-Term Memory

โ€ข RNN is Turing Complete

With Proper Configuration, it can simulate any sequence generated by a

Turing Machine

โ€ข Error Vanishing and Exploding

Consider single unit self loop:

๐›ฟ๐‘˜ ๐œ =๐œ•๐ฝ ๐‘ก

๐œ•๐‘ ๐‘˜ ๐œ=

๐œ•๐ฝ ๐‘ก

๐œ•๐‘ฆ๐‘˜ ๐œ

๐œ•๐‘ฆ๐‘˜ ๐œ

๐œ•๐‘ ๐‘˜ ๐œ= ๐‘“๐‘˜

โ€ฒ ๐‘ ๐‘˜ ๐œ ํœ€๐‘˜ ๐œ = ๐‘“๐‘˜โ€ฒ ๐‘ ๐‘˜ ๐œ ๐‘ค๐‘˜๐‘˜๐›ฟ๐‘˜ ๐œ + 1

โ€ข Long Short-term Memory

4/6/2015 Tinghui Wang, EECS, WSU 11

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735โ€“1780.

Page 12: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Purpose of Neural Turing Machine

โ€ข Enrich the capabilities of standard RNN to simplify the

solution of algorithmic tasks

4/6/2015 Tinghui Wang, EECS, WSU 12

Page 13: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Read/Write Operation

4/6/2015 Tinghui Wang, EECS, WSU 13

M-S

ize

Vec

tor

1 2 3 N...

w1 w2 w3 wN...

๐‘Ÿ๐‘ก =

๐‘–

๐‘ค๐‘ก ๐‘– ๐‘€๐‘ก ๐‘–

M-S

ize

Vec

tor

1 2 3 N...

w1 w2 w3 wN...

Erase:๐‘€๐‘ก ๐‘– = ๐‘€๐‘กโˆ’1 ๐‘– 1 โˆ’ ๐‘ค๐‘ก ๐‘– ๐‘’๐‘ก

Add:๐‘€๐‘ก ๐‘– += ๐‘ค๐‘ก ๐‘– ๐‘Ž๐‘ก

๐‘ค๐‘ก ๐‘– : vector of weights over the N locations

emitted by a read head at time t

๐‘’๐‘ก: Erase Vector, emitted by the header

๐‘Ž๐‘ก: Add Vector, emitted by the header

๐‘ค๐‘ก: weights vector

Page 14: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Attention & Focus

โ€ขAttention and Focus is adjusted by weights vector across

whole memory bank!

4/6/2015 Tinghui Wang, EECS, WSU 14

Page 15: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Content Based Addressing

โ€ข Find Similar Data in memory

4/6/2015 Tinghui Wang, EECS, WSU 15

๐’Œ๐‘ก: Key Vector

๐›ฝ๐‘ก: Key Strength

๐‘ค๐‘ก๐‘ ๐‘– =

exp ๐›ฝ๐‘ก๐พ ๐’Œ๐‘ก, ๐‘ด๐‘ก ๐‘–

๐‘— exp ๐›ฝ๐‘ก๐พ ๐’Œ๐‘ก, ๐‘ด๐‘ก ๐‘—

๐พ ๐’–, ๐’— =๐’– โˆ™ ๐’—

๐’– โˆ™ ๐’—

Page 16: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

May not depend on content onlyโ€ฆ

โ€ขGating Content Addressing

4/6/2015 Tinghui Wang, EECS, WSU 16

๐‘”๐‘ก: Interpolation gate in the range of (0,1)

๐’˜๐‘ก๐‘”= ๐‘”๐‘ก๐’˜๐‘ก

๐‘ + 1 โˆ’ ๐‘”๐‘ก ๐’˜๐‘กโˆ’1

Page 17: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Convolutional Shift โ€“ Location Addressing

4/6/2015 Tinghui Wang, EECS, WSU 17

๐’”๐‘ก: Shift weighting โ€“ Convolutional Vector (size N)

๐’˜๐‘ก ๐‘– =

๐‘—=0

๐‘โˆ’1

๐’˜๐‘ก๐‘”๐‘— ๐’”๐‘ก ๐‘– โˆ’ ๐‘—

Page 18: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Convolutional Shift โ€“ Location Addressing

4/6/2015 Tinghui Wang, EECS, WSU 18

๐›พ๐‘ก: sharpening factor

๐’˜๐‘ก ๐‘– = ๐’˜๐‘ก ๐‘–

๐›พ๐‘ก

๐‘— ๐’˜๐‘ก ๐‘—๐›พ๐‘ก

Page 19: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Neural Turing Machine - Experiments

โ€ข Goal: to demonstrate NTM is

Able to solve the problems

By learning compact internal programs

โ€ข Three architectures:

NTM with a feedforward controller

NTM with an LSTM controller

Standard LSTM controller

โ€ข Applications

Copy

Repeat Copy

Associative Recall

Dynamic N-Grams

Priority Sort

4/6/2015 Tinghui Wang, EECS, WSU 19

Page 20: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Copy

โ€ข Task: Store and recall a long sequence

of arbitrary information

โ€ข Training: 8-bit random vectors with

length 1-20

โ€ข No inputs while it generates the targets

4/6/2015 Tinghui Wang, EECS, WSU 20

NTM

LSTM

Page 21: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Copy โ€“ Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 21

Page 22: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Repeated Copy

4/6/2015 Tinghui Wang, EECS, WSU 22

โ€ข Task: Repeat a sequence a specified

number of times

โ€ข Training: random length sequences of

random binary vectors followed by a

scalar value indicating the desired

number of copies

Page 23: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Repeated Copy โ€“ Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 23

Page 24: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Associative Recall

4/6/2015 Tinghui Wang, EECS, WSU 24

โ€ข Task: ask the network to produce the next item,

given current item after propagating a sequence to

network

โ€ข Training: each item is composed of three six-bit

binary vectors, 2-8 items every episode

Page 25: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Dynamic N-Grams

โ€ข Task: Learn N-Gram Model โ€“ rapidly

adapt to new predictive distribution

โ€ข Training: 6-Gram distributions over

binary sequences. 200 successive

bits using look-up table by drawing

32 probabilities from Beta(.5,.5)

distribution

โ€ข Compare to Optimal Estimator:

๐‘ƒ ๐ต = 1 ๐‘1, ๐‘2, ๐’„ =๐‘1 + 0.5

๐‘1 + ๐‘0 + 1

4/6/2015 Tinghui Wang, EECS, WSU 25

C=01111 C=00010

Page 26: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Priority Sorting

โ€ข Task: Sort Binary Vector based on priority

โ€ข Training: 20 binary vectors with

corresponding priorities, output 16

highest-priority vectors

4/6/2015 Tinghui Wang, EECS, WSU 26

Page 27: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

NTM Experiments: Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 27

Page 28: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Some Detailed Parameters

4/6/2015 Tinghui Wang, EECS, WSU 28

NTM with

Feed Forward

Neural Network

NTM with

LSTM

LSTM

Page 29: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

Additional Information

โ€ข Literature Reference:

Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533โ€“536

R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735โ€“1780

Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou, ., Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007.

โ€ข NTM Implementation

https://github.com/shawntan/neural-turing-machines

โ€ข NTM Reddit Discussion https://www.reddit.com/r/MachineLearning/comments/2m9mga/implementation_of_neural_turing_ma

chines/

4/6/2015 Tinghui Wang, EECS, WSU 29

Page 30: Neural Turing Machine - Washington State Universitycook/aiseminar/papers/steve.pdf4/6/2015 Tinghui Wang, EECS, WSU 9 R. J. Williams and D. Zipser. Gradient-based learning algorithms

4/6/2015 Tinghui Wang, EECS, WSU 30