sta141c: big data & high performance statistical computingchohsieh/teaching/sta141c... · time...

STA141C: Big Data & High Performance Statistical Computing Lecture 15: Recurrent Neural Networks Cho-Jui Hsieh UC Davis June 7, 2018

Upload: others

Post on 08-Aug-2020

7 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

STA141C: Big Data & High Performance StatisticalComputing

Lecture 15: Recurrent Neural Networks

Cho-Jui HsiehUC Davis

June 7, 2018

Page 2: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network

Time Series/Sequence Data

Input: {x1, x2, · · · , xT}Each xt is the feature at time step tEach xt can be an d-dimensional vector

Output: {y1, y2, · · · , yT}Each yt is the output at step tMulti-class output or Regression output:

yt ∈ {1, 2, · · · , L} or yt ∈ R

Page 4: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Time Series Prediction

Climate Data:

xt : temperature at time tyt : temperature (or temperature change) at time t + 1

Stock Price: Predicting stock price

Page 5: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Time Series Prediction

Climate Data:

xt : temperature at time tyt : temperature (or temperature change) at time t + 1

Stock Price: Predicting stock price

Page 6: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Language Modeling

xt : one-hot encoding to represent the word at step t([0, . . . , 0, 1, 0, . . . , 0])yt : the next word

yt ∈ {1, · · · ,V } V: Vocabulary size

Page 7: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: Language Modeling

xt : one-hot encoding to represent the word at step t([0, . . . , 0, 1, 0, . . . , 0])yt : the next word

yt ∈ {1, · · · ,V } V: Vocabulary size

Page 8: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Example: POS Tagging

Part of Speech Tagging:

Labeling words with their Part-Of-Speech (Noun, Verb, Adjective,· · · )xt : a vector to represent the word at step t

yt : label of word t

Page 9: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

xt : t-th input

st : hidden state at time t (“memory” of the network)

st = f (Uxt + Wst−1)

W : transition matrix s0 usually set to be 0

Predicted output at time t:

ot = arg maxi

(V st)i

Page 10: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

Training: Find U,W ,V to minimize empirical loss:

Loss of a sequence:T∑t=1

loss(V st , yt)

(st is a function of U,W ,V )

Loss on the whole dataset:

Average loss of all sequence

Solve by Stochastic Gradient Descent (SGD)

Page 11: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

Training: Find U,W ,V to minimize empirical loss:

Loss of a sequence:T∑t=1

loss(V st , yt)

(st is a function of U,W ,V )

Loss on the whole dataset:

Average loss of all sequence

Solve by Stochastic Gradient Descent (SGD)

Page 12: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Recurrent Neural Network (RNN)

Training: Find U,W ,V to minimize empirical loss:

Loss of a sequence:T∑t=1

loss(V st , yt)

(st is a function of U,W ,V )

Loss on the whole dataset:

Average loss of all sequence

Solve by Stochastic Gradient Descent (SGD)

Page 13: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

RNN: Text Classification

Not necessary to output at each step

Text Classification:Sentence → category

Output only at the final step

Model: add a fully connected network to the final embedding

Page 14: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

RNN: Neural Machine Translation

Page 15: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Problems of Classical RNN

Hard to capture long-term dependencies

Hard to solve (vanishing gradient problem)

Solution:

LSTM (Long Short Term Memory networks)GRU (Gated Recurrent Unit)· · ·

Page 16: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

LSTM

RNN:

LSTM:

Page 17: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C... · Time Series/Sequence Data Input: fx 1;x 2; ;x Tg Each x t is the feature at time step t Each

Conclusions

Final project.

Questions?

Warm Up Solve each equation. 1. –5a = 30 2. Graph each inequality. 5. x ≥ –10 6. x < –3 –6 –10 3.4

STA141C: Big Data & High Performance Statistical …chohsieh/teaching/STA141C_Spring...Then WSVD = U k k is the context representation of each word (Each row is a k-dimensional feature

1-6 Function Operations and Composition of · PDF fileFind ( f + g)(x), (f ± g)(x), (f { g)(x), and (x) for each f(x) and g(x). State the domain of each new function. f(x) = x2 +

Grant Austin Display Packaging · 200GWOE Medals, Plaques Each 198 x 133 x 40 725GWOE Medals, Plaques Each 198 x 206 x 40 SPD33 * Medallion (double door) 6 143 x 205 x 30 SPD79 **

STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2018/lecture3.pdfArray Array of size n: stored in n contiguous memory space Access any

Warm-Up 2/19 Factor each polynomial. 1. x² – x – 20 (x – 5)(x + 4)

8.3 Multiplying Exponents. 8.3 – Multiplying Exponents Warm Up Write each expression using an exponent. 1. 2 2 2 2. x x x x 3. Write each expression without

DEVON · Normal Price £14.95 Each Nett To £7.48 Each Offer Price £14.50 Normal Price £17.25 Each Contains: 1 x Head Jog 76 1 x Head Jog 77 1 x Head Jog 78 1 x Head Jog 79 1 x

STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:

Warm-Up Solve for x in each equation. a) 3 x = b) log 2 x = –4 c) 5 x = 30

TUdbridenmath.weebly.com/uploads/8/8/1/2/88121160/segments_in_circ… · Find x and length of each chord. Find x and length of each chord. Given: # &=12 Find x and length of each

x b x c a x )( P x a c x P x a x b x c x c alibrary.cusat.ac.in/cat/2017/BTECH -2017.pdf · each unanswered question,0 marks for each wrong answer and 4 marks for each correct answer

Globetrotting! - Amazon Web Services · Page 2 Cutting Directions Cut each strip 1.5” x 13” Cut each strip 1.5” x 7” Cut 4 squares 4” x 4” Cut 1 square 3.5” x 3.5”

Working Paper Seriesrgea.webs.uvigo.es/workingpapers/rgea1105.pdf · take x,y ∈ RA. We say y ≤ x when y i ≤ x i for each i ∈ A and y < x when y i < x i for each i ∈

1.Compare these three graphs. 2.What conclusion do you make from each graph? 3.What conclusion would your students make from each graph? x x x x x xx x

Print - National Fire Protection Association voltage X X System output voltage, each leg X X System output current, each leg X X System output frequency (ac output systems only

“Tubers” Concrete, steel, oxides Variable 3.5’ x 4’ x 4’ each 2005

STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lecture4.pdf · Array Array of size n: stored in n contiguous memory space Access

STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

SCOPE OF WORK - USEmbassy.gov...1.19 MADERA 2" X 3" X 10´ (TENDALES) 2" X 3" X 10´ WOOD (TENDALES) Each 120.0 1.20 MADERA 1" X 3" X 10´ (REGLAS) 1" X 3" X 10´ WOOD (REGLAS) Each

STA141C: Big Data & High Performance Statistical …chohsieh/teaching/STA141C_Spring2017/lecture...STA141C: Big Data & High Performance Statistical Computing Lecture 12: Parallel Computing

Each X will increase the oxidation number of metal by +1. Each L and X will supply 2 electrons to the electron count

Consider each of the following functions: f(x) = |x – 1| / (x – 1) g(x) = (1 – x) / (1 – x) The domain for each function includes {x | x 1, x 0},

New MDF Different web browsers Pricing Cutouts Cutouts... · 2018. 3. 19. · Shoe 6” x 7 1/2” $5.00 each WC-SPR-SM / Small Spur 3 ½” x 8” 2” x 4” $4.00 each $1.10 each

Formed by nature - crafted by Strata€¦ · RIALTO SILVER MIST RIALTO (Sandstone) Size (mm) 2Pcs./Pack m /Pack 300 x 300 x 20 12 each 600 x 300 x 20 26 each 11.52 600 x 600 x 20

STA141C: Big Data & High Performance Statistical Computing

x-merry mask catalogue(x-a series 100pcs each)

Pizza Supplies - wincous.com€¦ · 20/06/2019 · HAC-082 8"Dia x 2"H Each 6/24 HAC-102 10"Dia x 2"H Each 6/24 HAC-122 12"Dia x 2"H Each 6/24 HAC-142 14"Dia x 2"H Each 6/24 HAC-162

UNDERSTAND PRACTICE€¦ · placing photos in a mat for a gallery show. Each mat she uses is x in. wide on each side. The total area of each photo and mat is shown. x x x x Area =

8.3 – Multiplying Exponents Warm Up Write each expression using an exponent. 1. 2 2 2 2. x x x x 3. Write each expression without using an exponent. 4

STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture5.pdf · STA141C: Big Data & High Performance Statistical Computing Lecture 5:

Warm Up Simplify each expression. 1. 2. 3. 4. 5. f(x) = x 2 – 18x + 16 f(x) = x 2 + 8x – 24 Find the zeros of each function

ECE 307-Techniques for Engineering Decisions · 1.5 0. 2 1.5 20 2 1.5 0.5 8 5 20 max Z = x x x y x x x lumber y x x x finishing y x x x carpentry x, ... pay for each unit of each

Section 1.1. Warm Up Graph each inequality. 1. x ≥ 3 2. 2 ≤ x ≤ 6 3. x 0

Simplify each expression. 1. 90 – ( x + 20) 2. 180 – (3 x – 10)