lstms exploit linguistic attributes of datanfliu/papers/liu+levy+schwartz+tan+smith.re… · code:...

Code: git.io/lstms-exploit Paper: bit.ly/lstms-exploit ACL 2018 Workshop on Representation Learning for NLP TL;DR • Data with linguistic attributes helps LSTMs learn a non-linguistic memorization task. • To solve the task, LSTMs use individual neurons to count timesteps. • We hypothesize that LSTMs pick up on the patterns and structure in linguistic data and use them as additional noisy training signal. Testbed Memorization Task • Given a constant-length sequence of tokens, predict the identity of the middle token seen. w 1 LSTM LSTM w 2 ... LSTM w k 2 LSTM ... LSTM w k -1 LSTM w k LSTM Target [ Pierre Vinken , 61 ] [ years old , will ] [ join the board as ] Directly take sequences 1. Language setting Training Datasets with Various Linguistic Attributes 3. Uniform setting [ volume vice , a ] [ pilots or sign corp. ] [ amid in <unk> banks ] Randomly pick tokens from vocabulary 2. n-gram setting 2. Permute the chunks [ Pierre Vinken ] [ , 61 ] [ years old ] [ , will ] [ join the ] [ board as ] … 1. Chunk corpus into pieces of size n (n = 2 in this example) [ join the Pierre Vinken] [ , 61 , will ] [ years old board as ] 3. Split into sequences of desired length (4 in this example) [ g m d r p j w f h c ] [ 3 5 6 8 4 0 2 7 9 1 ] [ ! " # ↩ % & ' ( ) * ] • This task is inherently nonlinguistic. Experiments Models trained on data with linguistic features generalize better • Test data: uniform distribution over the 100 rarest words in the PTB. • Ensures that models truly generalize and are not just using training data-speciﬁc features. What happens if we add more hidden units? To solve the task, RNNs learn to count The RNN exploits linguistic features to bootstrap itself early in training and learns to generalize later. Figure 4: Validation set has the same distribution as train. We see validation accuracy improves faster than test at early epochs (exploiting linguistic features), but the two curves move in unison in later epochs (true memorization). Figure 2: Models trained on Uniform data do poorly, even with more hidden units. Further Analysis • We further study an LSTM with 100 hidden units trained on Language, where train and test sequences are of length 300. Nelson F. Liu ♠ ♡ , Omer Levy ♠ , Roy Schwartz ♠♣ , Chenhao Tan ♢ , Noah A. Smith ♠♣ ♠Paul G. Allen School of Computer Science & Engineering; ♡Department of Linguistics, University of Washington; ♣Allen Institute for Artiﬁcial Intelligence; ♢Department of Computer Science, University of Colorado Boulder LSTMs Exploit Linguistic Attributes of Data UW NLP

Upload: others

Post on 21-Jun-2020

5 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: LSTMs Exploit Linguistic Attributes of Datanfliu/papers/liu+levy+schwartz+tan+smith.re… · Code: git.io/lstms-exploit Paper: bit.ly/lstms-exploit ACL 2018 Workshop on Representation

Code: git.io/lstms-exploit Paper: bit.ly/lstms-exploit ACL 2018 Workshop on Representation Learning for NLP

TL;DR

• Data with linguistic attributes helps LSTMs learn a non-linguistic memorization task.

• To solve the task, LSTMs use individual neurons to count timesteps.

• We hypothesize that LSTMs pick up on the patterns and structure in linguistic data and use them as additional noisy training signal.

Testbed Memorization Task• Given a constant-length sequence of tokens,

predict the identity of the middle token seen.

LSTM LSTM

w2 . . .

LSTM

w k2

LSTM

. . .

LSTM

wk�1

LSTM

Target

[ Pierre Vinken , 61 ][ years old , will ][ join the board as ]

Directly take sequences

1. Language setting

Training Datasets with Various Linguistic Attributes

3. Uniform setting

[ volume vice , a ][ pilots or sign corp. ][ amid in <unk> banks ]

Randomly pick tokens from vocabulary

2. n-gram setting

2. Permute the chunks

[ Pierre Vinken ] [ , 61 ] [ years old ][ , will ] [ join the ] [ board as ] …

1. Chunk corpus into pieces of size n(n = 2 in this example)

[ join the Pierre Vinken]

[ , 61 , will ][ years old board as ]3. Split into sequences

of desired length (4 in this example)

[ g m d r p j w f h c ] [ 3 5 6 8 4 0 2 7 9 1 ]

[ ! " # ↩ % & ' ( ) * ]

• This task is inherently nonlinguistic.

Experiments

Models trained on data with linguistic features generalize better

• Test data: uniform distribution over the 100 rarest words in the PTB.

• Ensures that models truly generalize and are not just using training data-specific features.

What happens if we add more hidden units?

To solve the task, RNNs learn to count

The RNN exploits linguistic features to bootstrap itself early in training and learns to generalize later.

Figure 4: Validation set has the same distribution as train. We see validation accuracy improves faster than test at early epochs (exploiting linguistic features), but the two curves move in unison in later epochs (true memorization).

Figure 2: Models trained on Uniform data do poorly, even with more hidden units.

Further Analysis• We further study an LSTM with 100 hidden units trained on

Language, where train and test sequences are of length 300.

Nelson F. Liu♠ ♡, Omer Levy♠, Roy Schwartz♠♣, Chenhao Tan♢, Noah A. Smith♠♣♠Paul G. Allen School of Computer Science & Engineering; ♡Department of Linguistics, University of Washington;♣Allen Institute for Artificial Intelligence; ♢Department of Computer Science, University of Colorado Boulder

LSTMs Exploit Linguistic Attributes of Data UWNLP

https://git.io/lstms-exploit

http://bit.ly/lstms-exploit

Encouraging LSTMs to Anticipate Actions Very Early...Encouraging LSTMs to Anticipate Actions Very Early Mohammad Sadegh Aliakbarian1,3, Fatemeh Sadat Saleh1,3, Mathieu Salzmann2, Basura

Introduction to Tree-LSTMs

Beyond detection: GANs and LSTMs to pay attention at human … · 2017. 10. 27. · Beyond detection: GANs and LSTMs to pay attention at human presence Rita Cucchiara Imagelab, Dipartimento

Complete document available: bit.ly/respondingsuicide

Deep Learning Class #3 - Take Two LSTMs

bit.ly/2Msr7Dd - hsd.k12.or.us

Pocket Code (the app): //bit.ly/pocketcode2 Pocket Paint (tool for creating images): //bit.ly/pocketpaint

Scalable GP-LSTMs with Semi-Stochastic Gradients · 2019-10-17 · Scalable GP-LSTMs with Semi-Stochastic Gradients Maruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu

Time series predictions using LSTMs