sta141c: big data & high performance statistical computingchohsieh/teaching/sta141c... · time...

17
STA141C: Big Data & High Performance Statistical Computing Lecture 15: Recurrent Neural Networks Cho-Jui Hsieh UC Davis June 7, 2018

Upload: others

Post on 08-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

STA141C: Big Data & High Performance StatisticalComputing

Lecture 15: Recurrent Neural Networks

Cho-Jui HsiehUC Davis

June 7, 2018

Page 2: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network

Page 3: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Time Series/Sequence Data

Input: {x1, x2, · · · , xT}Each xt is the feature at time step tEach xt can be an d-dimensional vector

Output: {y1, y2, · · · , yT}Each yt is the output at step tMulti-class output or Regression output:

yt ∈ {1, 2, · · · , L} or yt ∈ R

Page 4: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Time Series Prediction

Climate Data:

xt : temperature at time tyt : temperature (or temperature change) at time t + 1

Stock Price: Predicting stock price

Page 5: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Time Series Prediction

Climate Data:

xt : temperature at time tyt : temperature (or temperature change) at time t + 1

Stock Price: Predicting stock price

Page 6: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Language Modeling

xt : one-hot encoding to represent the word at step t([0, . . . , 0, 1, 0, . . . , 0])yt : the next word

yt ∈ {1, · · · ,V } V: Vocabulary size

Page 7: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Language Modeling

xt : one-hot encoding to represent the word at step t([0, . . . , 0, 1, 0, . . . , 0])yt : the next word

yt ∈ {1, · · · ,V } V: Vocabulary size

Page 8: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: POS Tagging

Part of Speech Tagging:

Labeling words with their Part-Of-Speech (Noun, Verb, Adjective,· · · )xt : a vector to represent the word at step t

yt : label of word t

Page 9: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

xt : t-th input

st : hidden state at time t (“memory” of the network)

st = f (Uxt + Wst−1)

W : transition matrix s0 usually set to be 0

Predicted output at time t:

ot = arg maxi

(V st)i

Page 10: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

Training: Find U,W ,V to minimize empirical loss:

Loss of a sequence:T∑t=1

loss(V st , yt)

(st is a function of U,W ,V )

Loss on the whole dataset:

Average loss of all sequence

Solve by Stochastic Gradient Descent (SGD)

Page 11: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

Training: Find U,W ,V to minimize empirical loss:

Loss of a sequence:T∑t=1

loss(V st , yt)

(st is a function of U,W ,V )

Loss on the whole dataset:

Average loss of all sequence

Solve by Stochastic Gradient Descent (SGD)

Page 12: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

Training: Find U,W ,V to minimize empirical loss:

Loss of a sequence:T∑t=1

loss(V st , yt)

(st is a function of U,W ,V )

Loss on the whole dataset:

Average loss of all sequence

Solve by Stochastic Gradient Descent (SGD)

Page 13: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

RNN: Text Classification

Not necessary to output at each step

Text Classification:Sentence → category

Output only at the final step

Model: add a fully connected network to the final embedding

Page 14: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

RNN: Neural Machine Translation

Page 15: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Problems of Classical RNN

Hard to capture long-term dependencies

Hard to solve (vanishing gradient problem)

Solution:

LSTM (Long Short Term Memory networks)GRU (Gated Recurrent Unit)· · ·

Page 16: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

LSTM

RNN:

LSTM:

Page 17: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Conclusions

Final project.

Questions?