angel xuan chang | angel xuan chang - recurrent neural networks · 2020. 12. 2. · recurrent...

Post on 01-Apr-2021

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Recurrent Neural Networks

Fall 20202020-10-16

CMPT 413 / 825: Natural Language Processing

How to model sequences using neural networks?

(Some slides adapted from Chris Manning, Abigail See, Andrej Karpathy)

SFUNatLangLab

Adapted from slides from Anoop Sarkar, Danqi Chen, Karthik Narasimhan, and Justin Johnson1

Overview

• What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs

2

Recurrent neural networks (RNNs)

A class of neural networks allowing to handle variable length inputs

A function: y = RNN(x1, x2, …, xn) ∈ ℝd

where x1, …, xn ∈ ℝdin

3

Recurrent neural networks (RNNs)

Proven to be an highly effective approach to language modeling, sequence tagging as well as text classification tasks:

Language modeling Sequence tagging

The movie sucks .

👎

Text classification

4

Recurrent neural networks (RNNs)

Form the basis for the modern approaches to machine translation, question answering and dialogue:

5

Why variable-length?Recall the feedfoward neural LMs we learned:

The dogs are barking

the dogs in the neighborhood are ___

x = [ethe, edogs, eare] 2 R3d<latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit>

(fixed-window size = 3)

6

RNNs vs Feedforward NNs

7

RNNs for sequence processing

Vanilla Neural

Networks

Sequence Tagging

Neural Machine Translation

Text ClassificationText Generation

8

Recurrent Neural Language Models

9

Language Modeling

Predict probability of sequence of words

<latexit sha1_base64="8u8r93p2B+o46uuE0CaGrvKG1tw=">AAACJHicbVBNSwMxFMzWr/pd9eglWIQWStkVRQ8tFL14rNCq0NYlm03b0OxmSd5ayrp/xoO/xYOXKh68+FtMaw9qHQgMM/N4eeNFgmuw7Q8rs7C4tLySXV1b39jc2s7t7F5rGSvKmlQKqW49opngIWsCB8FuI8VI4Al24w0uJv7NPVOay7ABo4h1AtILeZdTAkZyc5V6QRdxFdcLQ9cp4bbwJegSHrqNidqOlPTdBKpOepc00mkK8IOxkwqkRTeXt8v2FHieODOSRzPU3dy47UsaBywEKojWLceOoJMQBZwKlq61Y80iQgekx1qGhiRgupNMr0zxoVF83JXKvBDwVP05kZBA61HgmWRAoK//ehPxP68VQ/esk/AwioGF9HtRNxYYJJ5Uhn2uGAUxMoRQxc1fMe0TRSiYYk0Hzt+L58n1Udk5KdtXx/na+ayNLNpHB6iAHHSKaugS1VETUfSIntEYvVpP1ov1Zr1/RzPWbGYP/YL1+QXWcqIq</latexit>

P (s) = P (w1, . . . , wT ) =TY

t=1

P (wt|w<t)

10

with n-grams

with HMMs

<latexit sha1_base64="te4UgyxB6fuTxCWXFdKYrmQYx24=">AAACF3icbVBNS8NAEN3U7++qRy+LRVC0NRFFDx6KXjxWsB/QhrDZbnTpZhN2J2qJ+Rke/C0evKh47dF/47b2UFsfDDzem2Fmnh8LrsG2v63c1PTM7Nz8wuLS8srqWn59o6ajRFFWpZGIVMMnmgkuWRU4CNaIFSOhL1jd71z2/fo9U5pH8ga6MXNDcit5wCkBI3n5w8rugwf4CT946Tlke7hF4lhFj3hEh6Lcdw4wFJ1sz8sX7JI9AJ4kzpAU0BAVL99rtSOahEwCFUTrpmPH4KZEAaeCZYutRLOY0A65ZU1DJQmZdtPBYxneMUobB5EyJQEP1NGJlIRad0PfdIYE7vS41xf/85oJBGduymWcAJP0d1GQCAwR7qeE21wxCqJrCKGKm1sxvSOKUDBZmgyc8Y8nSe2o5JyU7OvjQvlimMY82kLbaBc56BSV0RWqoCqi6Bm9onf0Yb1Yb9an9fXbmrOGM5voD6zeD3itnc4=</latexit>

P (wt|w<t) ⇡ P (wt|wt�n+1,t�1)

<latexit sha1_base64="1W+OCq4/tF5nlGADlRrX14petk8=">AAACJXicbVDLSgMxFM34tr6qLt0Ei9CiLTOi6EJBdOOygtVCW4ZMmukEM5khuaOWsV/jwm9x4cIHgit/xbQdQasHAueecy8393ix4Bps+8MaG5+YnJqemc3NzS8sLuWXVy50lCjKajQSkap7RDPBJasBB8HqsWIk9AS79K5O+v7lNVOaR/IcujFrhaQjuc8pASO5+cNq8cYFfIdv3PQAeiXcJHGsolv8rQculEwRZEUKZbnpbGEoO6bZzRfsij0A/kucjBRQhqqbf262I5qETAIVROuGY8fQSokCTgXr5ZqJZjGhV6TDGoZKEjLdSgdn9vCGUdrYj5R5EvBA/TmRklDrbuiZzpBAoEe9vvif10jA32+lXMYJMEmHi/xEYIhwPzPc5opREF1DCFXc/BXTgChCwSRrMnBGL/5LLrYrzm7FPtspHB1nacygNbSOishBe+gInaIqqiGK7tEjekGv1oP1ZL1Z78PWMSubWUW/YH1+AYpxol4=</latexit>

P (wt|w<t) ⇡ P (wt|ht)P (ht|ht�n+1,t�1)

with neural networks

with fixed window

with RNNs

<latexit sha1_base64="M5BrvPdeSmcyfagEkftBYaSLsTY=">AAACLHicbVDNSgMxGMz6b/2revQSLIKFWnZF0YMHUQ8eFWwrdMuSzaY2mN2E5Fu1LPtCHnwWBS8qXn0Os20Pah0ITGbmI/kmVIIbcN03Z2Jyanpmdm6+tLC4tLxSXl1rGplqyhpUCqmvQ2KY4AlrAAfBrpVmJA4Fa4W3p4XfumPacJlcQV+xTkxuEt7llICVgvKZ2r4PAPsxj/B9kB1BXsU+UUrLB1xYGeRD01c9bu9eDfsikmBqRRx2vLxaLQXlilt3B8DjxBuRChrhIii/+JGkacwSoIIY0/ZcBZ2MaOBUsLzkp4YpQm/JDWtbmpCYmU422DbHW1aJcFdqexLAA/XnREZiY/pxaJMxgZ756xXif147he5hJ+OJSoEldPhQNxUYJC6qwxHXjILoW0Ko5vavmPaIJhRswbYD7+/G46S5W/f26+7lXuX4ZNTGHNpAm2gbeegAHaNzdIEaiKJH9Ize0Lvz5Lw6H87nMDrhjGbW0S84X98OMqZu</latexit>

p(wt | w<t) ⇡ p(wt | �(w1, . . . , wt�1))

<latexit sha1_base64="nVLTfTeRZqZQDE/Bjafx+3dZXyk=">AAACHXicbZDLSgMxFIYzXmu9jbp0EyyCRS0zoujCRdGNywr2Am0ZMmlqg5lMSM5Yy9gnceGzuHCjIrgS38b0Inj7IfDxn3M4OX+oBDfgeR/OxOTU9MxsZi47v7C4tOyurFZMnGjKyjQWsa6FxDDBJSsDB8FqSjMShYJVw6vTQb16zbThsbyAnmLNiFxK3uaUgLUC96C01Q0A3+JukB5DP48bRCkd3+Avv6E63GIKu3Lb38Gw6/fz+cDNeQVvKPwX/DHk0FilwH1rtGKaREwCFcSYuu8paKZEA6eC9bONxDBF6BW5ZHWLkkTMNNPheX28aZ0WbsfaPgl46H6fSElkTC8KbWdEoGN+1wbmf7V6Au2jZsqlSoBJOlrUTgSGGA+ywi2uGQXRs0Co5vavmHaIJhRsojYD//fFf6GyV/APCt75fq54Mk4jg9bRBtpCPjpERXSGSqiMKLpDD+gJPTv3zqPz4ryOWiec8cwa+iHn/ROp5J/4</latexit>

P (wt|w<t) ⇡ P (wt|�(wt�n+1,t�1))

<latexit sha1_base64="UPWhrdY1LIvfIYfgqE5xkjBxR3o=">AAACQ3icbVDLSgMxFM34flt16SZYhBa0zIiiC4WiG5cVrC10hiGTZmxo5kFyRy3jfJwLP8ClX+DCjYpbwUxboVYPBE7Ouffm5nix4ApM89mYmJyanpmdm19YXFpeWS2srV+pKJGU1WkkItn0iGKCh6wOHARrxpKRwBOs4XXPcr9xw6TiUXgJvZg5AbkOuc8pAS25hVatdOsCvse3bnoMWRnbJI5ldId/dDsg0PH8tJO5UN75dcUn2C+NCCnsWtlOPiknZewWimbF7AP/JdaQFNEQNbfwZLcjmgQsBCqIUi3LjMFJiQROBcsW7ESxmNAuuWYtTUMSMOWk/RAyvK2VNvYjqU8IuK+OdqQkUKoXeLoyX1mNe7n4n9dKwD9yUh7GCbCQDh7yE4EhwnmiuM0loyB6mhAqud4V0w6RhILOXWdgjf/4L7naq1gHFfNiv1g9HaYxhzbRFiohCx2iKjpHNVRHFD2gF/SG3o1H49X4MD4HpRPGsGcD/YLx9Q0qD7AU</latexit>

P (wt|w<t) ⇡ P (wt|ht),ht = f(ht�1, wt�1)

Simple (vanilla) RNNs

h0 ∈ ℝd is an initial state

ht = fW(ht−1, xt) ∈ ℝd

ht = g(Whhht−1 + Uxt + b) ∈ ℝd

Simple (vanilla) RNNs:

Whh ∈ ℝd×d, U ∈ ℝd×din, b ∈ ℝd

: nonlinearity (e.g. tanh),g

ht : hidden states which store information from to x1 xt

x

RNN

y

Output label for each time step: Denote , yt = softmax(Woht) Wo ∈ ℝ|V|×d

new state old state input at time t

function with weights W

11

Simple RNNs

Key idea: apply the same weights repeatedlyW

ht = g(Wht−1 + Uxt + b) ∈ ℝd

12

Recurrent Neural Language Models (RNNLMs)

P(w1, w2, …, wn) = P(w1) × P(w2 ∣ w1) × P(w3 ∣ w1, w2) × … × P(wn ∣ w1, w2, …, wn−1)

= P(w1 ∣ h0) × P(w2 ∣ h1) × P(w3 ∣ h2) × … × P(wn ∣ hn−1)

• Denote , yt = softmax(Woht) Wo ∈ ℝ|V|×d

• Cross-entropy loss:

L(θ) = −1n

n

∑t=1

log yt−1(wt)

the students opened their …exams

θ = {W, U, b, Wo, E}

13

Progress on language models

On the Penn Treebank (PTB) dataset Metric: perplexity

(Mikolov and Zweig, 2012): Context dependent recurrent neural network language model

KN5: Kneser-Ney 5-gram

14

Progress on language models

(Yang et al, 2018): Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

On the Penn Treebank (PTB) dataset Metric: perplexity

dropping perplexity

15

Training the RNN

16

RNN Computation Graph

h0 fW h1

x1

17

h0 fW h1 fW h2

x2x1

RNN Computation Graph

18

h0 fW h1 fW h2 fW h3

x3

…x2x1

hT

RNN Computation Graph

19

h0 fW h1 fW h2 fW h3

x3

yT

…x2x1W

hT

y3y2y1

RNN Computation Graph

20

h0 fW h1 fW h2 fW h3

x3

yT

…x2x1W

hT

y3y2y1 L1 L2 L3 LT

RNN Computation Graph

21

h0 fW h1 fW h2 fW h3

x3

yT

…x2x1W

hT

y3y2y1 L1 L2 L3 LT

L

RNN Computation Graph

22

Training RNNLMs

• Backpropagation? Yes, but not that simple!

• The algorithm is called Backpropagation Through Time (BPTT).

23

Backpropagation through time

h1 = g(Wh0 + Ux1 + b)

h2 = g(Wh1 + Ux2 + b)

h3 = g(Wh2 + Ux3 + b)

L3 = − log y3(w4)

You should know how to compute: ∂L3

∂h3

∂L3

∂W=

∂L3

∂h3

∂h3

∂W+

∂L3

∂h3

∂h3

∂h2

∂h2

∂W+

∂L3

∂h3

∂h3

∂h2

∂h2

∂h1

∂h1

∂W

∂L∂W

= −1n

n

∑t=1

t

∑k=1

∂Lt

∂ht

t

∏j=k+1

∂hj

∂hj−1

∂hk

∂W

24

Loss

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

25

Truncated backpropagation through time

• Backpropagation is very expensive if you handle long sequences

• Run forward and backward through chunks of the sequence instead of whole sequence

• Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

26

Vanishing gradients

27

ht-1

xt

W

stack

tanh

ht

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

28

ht-1

xt

W

stack

tanh

ht

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

29

Vanishing gradients

30

Vanishing Gradients

31

Computing gradient of involves many factors of (and repeated tanh)

h0

W

Largest singular value > 1: Exploding gradients

Largest singular value < 1: Vanishing gradients

Gradient clipping: Scale gradient if its norm is too big

Change RNN architecture32

Different RNN cells

33

Long Short-term Memory (LSTM)• A type of RNN proposed by Hochreiter and Schmidhuber

in 1997 as a solution to the vanishing gradients problem

ht = f(ht−1, xt) ∈ ℝd

• Work extremely well in practice

• Basic idea: turning multiplication into addition

• Use “gates” to control how much information to add/erase

• At each timestep, there is a hidden state and also a cell state

• stores long-term information

• We write/erase after each step

• We read from

ht ∈ ℝd ct ∈ ℝd

ct

ct

ht ct

34

Long Short-term Memory (LSTM)

There are 4 gates:

• Input gate (how much to write): it = σ(W(i)ht−1 + U(i)xt + b(i)) ∈ ℝd

• Forget gate (how much to erase): ft = σ(W( f )ht−1 + U( f )xt + b( f )) ∈ ℝd

• Output gate (how much to reveal): ot = σ(W(o)ht−1 + U(o)xt + b(o)) ∈ ℝd

• New memory cell (what to write): gt = tanh(W(c)ht−1 + U(c)xt + b(c)) ∈ ℝd

How many parameters in total?

• Final memory cell: ct = ft ⊙ ct−1 + it ⊙ gt

• Final hidden cell: ht = ot ⊙ tanh(ct)

element-wise product

Backpropagation from to only element wise multiplication by , no matrix multiply by

ct ct−1

f W

4 × (d2 + ddin + 1)35

LSTM cell intuitively

36

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-term Memory (LSTM)

• LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies

• LSTMs were invented in 1997 but finally got working from 2013-2015.

37

Is the LSTM architecture optimal?

(Jozefowicz et al, 2015): An Empirical Exploration of Recurrent Network Architectures38

Simple RNN vs GRU vs LSTM

Simple RNN GRU LSTM

39

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Batching

• Apply RNNs to batches of sequences

• Present data as 3D tensor of ( )T × B × F

40

Batching

• Use mask matrix to aid with computations that ignore padded zeros

1 1 1 1 0 0

1 0 0 0 0 0

1 1 1 1 1 1

1 1 1 0 0 0

41

Batching

• Sorting (partially) can help to create more efficient mini-batches • However, the input is less randomized

Unsorted

Sorted

42

Overview

• What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs

43

Application: Text Generation

You can generate text by repeated sampling. Sampled output is next step’s input.44

Fun with RNNs

Andrej Karpathy “The Unreasonable Effectiveness of Recurrent Neural Networks”

Obama speeches Latex generation

45

Application: Sequence Tagging

P(yi = k) = softmaxk(Wohi) Wo ∈ ℝC×d

L = −1n

n

∑i=1

log P(yi = k)

Input: a sentence of n words: x1, …, xn

Output: y1, …, yn, yi ∈ {1,…C}

46

Application: Text Classification

the movie was terribly exciting !

hn

P(y = k) = softmaxk(Wohn) Wo ∈ ℝC×d

Input: a sentence of n words

Output: y ∈ {1,2,…, C}

47

Multi-layer RNNs

• RNNs are already “deep” on one dimension (unroll over time steps)

• We can also make them “deep” in another dimension by applying multiple RNNs

• Multi-layer RNNs are also called stacked RNNs.

48

Multi-layer RNNs

The hidden states from RNN layer are the inputs to RNN layer

ii + 1

• In practice, using 2 to 4 layers is common (usually better than 1 layer) • Transformer-based networks can be up to 24 layers with lots of skip-

connections.

49

Bidirectional RNNs

• Bidirectionality is important in language representations:

terribly: • left context “the movie was” • right context “exciting !”

50

Bidirectional RNNs

ht = f(ht−1, xt) ∈ ℝd

h t = f1(h t−1, xt), t = 1,2,…n

h t = f2(h t+1, xt), t = n, n − 1,…1

ht = [h t, h t] ∈ ℝ2d

51

Bidirectional RNNs

• Sequence tagging: Yes! • Text classification: Yes! With slight modifications.

• Text generation: No. Why?

terribly exciting !the movie wasterribly exciting !the movie was

Sentence encoding

element-wise mean/max element-wise mean/max

52

top related