recent progress in rnn and nlp

Recent Progressin RNN and NLP

Tohoku UniversityInui and Okazaki Lab.

Sosuke Kobayashi⼩林颯介

• Revised presentation slides from2016/6/22 NLP-DL MTG@Preferred Networks and2016/6/30 Inui and Okazaki Lab. Talk

• Overview of basic progress in RNN from late 2014• Attention is not included.

c.f. http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention

• Not published arXiv papers are marked with ” ”• Reference:

https://docs.google.com/document/d/1nmkidNi_MsRPbB65kHsmyMfGqmaQ0r5dW518J8k_aeI/edit?usp=sharing( https://goo.gl/kE6GCM )

Note

Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi

• Basic RNN• RNN’s Unit• Benchmarking various RNNs• Connections in RNNs• RNN and Tree• Regularizations and Tricks for RNN’s Learning• Decoding

Agenda


• Benchmarking RNN’s various units• Variants of LSTM or GRU• Examinations of gates in LSTM• Initialization trick of LSTM• High performance by simple units

• Visualization and analysis

1. Unit and Benchmark


LSTM and GRU• LSTM [Hochreiter&Schmidhuber97] • GRU [Cho+14]

(Biases are omitted.)Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi

• Search better unit structures from LSTM and GRU by mutating computation graphs • Arith.: Calculation with noise tokens• XML: Character-based prediction of XML tags• PTB: Language modeling

Discovered Units [Jozefowicz+15]


• Better units are similar to GRU• Due to bias from

search algorighm?

• MUT1: Update gate iscontrolled by only x (not h).It looks reasonable for Arith.

h_tをh_(t_1)にずらしてください

7

[Jozefowicz+15]

GRU:

Discovered Units


• Remove LSTM’s input, forget, output gates?• LSTM’s forget gate’s bias is initialized to +1?≒ Keeping 73% cell’s values initially?

• Initializing forget gate with positive bias is good([Gers+2000] also said so.)

• Dropout improves LSTM, not GRU, in language modeling

• Gates’ importance are f >> i > o.

Examination of LSTM [Jozefowicz+15]


• “LSTM: A Search Space Odyssey.” Cool title.• Good examinations on LSTM

• Gates, peephole, tanh before output,forget gate = 1 – input gate (like GRU)

• Full gate recurrence; gates are also controlled by gates’ values at previous step

• Peephole is not important,forget gate is important,f=1-i is good and can save the # of parameters

• You are recommended to use a common LSTM

[Greff+15]Examination of LSTM


• Structurally Constrained Recurrent Network (SCRN) [Mikolov+15]• RNN with a simple cell

by weighted sum

• IRNN [Le+15]• Simple RNN with

recurrent matrix initialized with identity matrixand ReLU instead of tanh

• Effects of diagonal and orthogonal matrix in RNN[Henaff+16]

Other Devised Units

Q is diagonal


• Minimal Gated Unit; MGU [Zhou+, 16]

Other GRU-like Units• Simple Gated Unit; SGU

[Gao+, 16]

• Deep SGU; DSGU[Gao+, 16]


• Multiplicative Integration [Wu+16]• Improvements by using multiplication with addition in

RNN, changing

into.

• Similarly in LSTM and GRU• Improve performances of many tasks• In the near future, this will get common ...?

Multiplicative Integration

On Multiplicative Integration withRecurrent Neural Networks

Yuhuai Wu1,⇤, Saizheng Zhang2,⇤ , Ying Zhang2, Yoshua Bengio2,4 and Ruslan Salakhutdinov3,41University of Toronto, 2MILA, Université de Montréal, 3Carnegie Mellon University, [email protected],

2{firstname.lastname}@umontreal.ca,[email protected]

Abstract

We introduce a general and simple structural design called “Multiplicative Integra-tion” (MI) to improve recurrent neural networks (RNNs). MI changes the way inwhich information from difference sources flows and is integrated in the compu-tational building block of an RNN, while introducing almost no extra parameters.The new structure can be easily embedded into many popular RNN models, includ-ing LSTMs and GRUs. We empirically analyze its learning behaviour and conductevaluations on several tasks using different RNN models. Our experimental resultsdemonstrate that Multiplicative Integration can provide a substantial performanceboost over many of the existing RNN models.

1 Introduction

Recently there has been a resurgence of new structural designs for recurrent neural networks (RNNs)[1, 2, 3]. Most of these designs are derived from popular structures including vanilla RNNs, LongShort Term Memory networks (LSTMs) [4] and Gated Recurrent Units (GRUs) [5]. Despite of theirvarying characteristics, most of them share a common computational building block, described by thefollowing equation:

�(Wx+Uz + b), (1)where x 2 Rn and z 2 Rm are state vectors coming from different information sources, W 2 Rd⇥n

and U 2 Rd⇥m are state-to-state transition matrices, and b is a bias vector. This computationalbuilding block serves as a combinator for integrating information flow from the x and z by a sumoperation “+”, followed by a nonlinearity �. We refer to it as the additive building block. Additivebuilding blocks are widely implemented in various state computations in RNNs (e.g. hidden statecomputations for vanilla-RNNs, gate/cell computations of LSTMs and GRUs.

In this work, we propose an alternative design for constructing the computational building block bychanging the procedure of information integration. Specifically, instead of utilizing sum operation“+", we propose to use the Hadamard product “�” to fuse Wx and Uz:

�(Wx�Uz + b) (2)

The result of this modification changes the RNN from first order to second order [6], while introducingno extra parameters. We call this kind of information integration design a form of MultiplicativeIntegration. The effect of multiplication naturally results in a gating type structure, in which Wx

and Uz are the gates of each other. More specifically, one can think of the state-to-state computationUz (where for example z represents the previous state) as dynamically rescaled by Wx (wherefor example x represents the input). Such rescaling does not exist in the additive building block, inwhich Uz is independent of x. This relatively simple modification brings about advantages over theadditive building block as it alters RNN’s gradient properties, which we discuss in detail in the nextsection, as well as verify through extensive experiments.

⇤Equal contribution.

arX

iv:1

606.

0663

0v1

[cs.L

G]

21 Ju

n 20

16In the following sections, we first introduce a general formulation of Multiplicative Integration. Wethen compare it to the additive building block on several sequence learning tasks, including characterlevel language modelling, speech recognition, large scale sentence representation learning using aSkip-Thought model, and teaching a machine to read and comprehend for a question answeringtask. The experimental results (together with several existing state-of-the-art models) show thatvarious RNN structures (including vanilla RNNs, LSTMs, and GRUs) equipped with MultiplicativeIntegration provide better generalization and easier optimization. Its main advantages include: (1) itenjoys better gradient properties due to the gating effect. Most of the hidden units are non-saturated;(2) the general formulation of Multiplicative Integration naturally includes the regular additivebuilding block as a special case, and introduces almost no extra parameters compared to the additivebuilding block; and (3) it is a drop-in replacement for the additive building block in most of thepopular RNN models, including LSTMs and GRUs. It can also be combined with other RNN trainingtechniques such as Recurrent Batch Normalization [7]. We further discuss its relationship to existingmodels, including Hidden Markov Models (HMMs) [8], second order RNNs [9, 6] and MultiplicativeRNNs [10].

2 Structure Description and Analysis2.1 General Formulation of Multiplicative Integration

The key idea behind Multiplicative Integration is to integrate different information flows Wx and Uz,by the Hadamard product “�”. A more general formulation of Multiplicative Integration includestwo more bias vectors �1 and �2 added to Wx and Uz:

�((Wx+ �1)� (Uz + �2) + b) (3)where �1,�2 2 Rd are bias vectors. Notice that such formulation contains the first order terms asin a additive building block, i.e., �1 �Uh

t�1 + �2 �Wx

t

. In order to make the MultiplicativeIntegration more flexible, we introduce another bias vector ↵ 2 Rd to gate2 the term Wx �Uz,obtaining the following formulation:

�(↵�Wx�Uz + �1 �Uz + �2 �Wx+ b), (4)Note that the number of parameters of the Multiplicative Integration is about the same as that of theadditive building block, since the number of new parameters (↵, �1 and �2) are negligible comparedto total number of parameters. Also, Multiplicative Integration can be easily extended to LSTMsand GRUs3, that adopt vanilla building blocks for computing gates and output states, where one candirectly replace them with the Multiplicative Integration. More generally, in any kind of structurewhere k information flows (k � 2) are involved (e.g. RNN with multiple skip connections [11]or in feedforward models like residual networks [12]), one can implement pairwise MultiplicativeIntegration for integrating all k information sources.

2.2 Gradient PropertiesThe Multiplicative Integration has different gradient properties compared to the additive buildingblock. For clarity of presentation, we first look at vanilla-RNN and RNN with MultiplicativeIntegration embedded, referred to as MI-RNN. That is, h

t

= �(Wx

t

+ Uh

t�1 + b) versush

t

= �(Wx

t

�Uh

t�1 + b). In a vanilla-RNN, the gradient @ht@ht�n

can be computed as follows:@h

t

@h

t�n

=

tY

k=t�n+1

UT

diag(�

0k

), (5)

where �0k

= �

0(Wx

k

+Uh

k�1 +b). The equation above shows that the gradient flow through timeheavily depends on the hidden-to-hidden matrix U, but W and x

k

appear to play a limited role: theyonly come in the derivative of �0 mixed with Uh

k�1. On the other hand, the gradient @ht@ht�n

of aMI-RNN is4:

@h

t

@h

t�n

=

tY

k=t�n+1

UT

diag(Wx

k

)diag(�

0k

), (6)

2If ↵ = 0, the Multiplicative Integration will degenerate to the vanilla additive building block.3See exact formulations in the Appendix.4Here we adopt the simplest formulation of Multiplicative Integration for illustration. In the more general

case (Eq. 4), diag(Wxk) in Eq. 6 will become diag(↵�Wxk + �1).

2


• Visualization of character-based language model• A cell got a function of apostrophes’ opening and

closing• But other cells can not be interpretable

Visualization

Under review as a conference paper at ICLR 2016

Figure 2: Several examples of cells with interpretable activations discovered in our best Linux Ker-nel and War and Peace LSTMs. Text color corresponds to tanh(c), where -1 is red and +1 is blue.

Figure 3: Left three: Saturation plots for an LSTM. Each circle is a gate in the LSTM and itsposition is determined by the fraction of time it is left or right-saturated. These fractions must add toat most one (indicated by the diagonal line). Right two: Saturation plot for a 3-layer GRU model.

gate to be left or right-saturated if its activation is less than 0.1 or more than 0.9, respectively, orunsaturated otherwise. We then compute the fraction of times that each LSTM gate spends left orright saturated, and plot the results in Figure 3. For instance, the number of often right-saturatedforget gates is particularly interesting, since this corresponds to cells that remember their valuesfor very long time periods. Note that there are multiple cells that are almost always right-saturated(showing up on bottom, right of the forget gate scatter plot), and hence function as nearly perfectintegrators. Conversely, there are no cells that function in purely feed-forward fashion, since theirforget gates would show up as consistently left-saturated (in top, left of the forget gate scatter plot).The output gate statistics also reveal that there are no cells that get consistently revealed or blockedto the hidden state. Lastly, a surprising finding is that unlike the other two layers that contain gateswith nearly binary regime of operation (frequently either left or right saturated), the activations inthe first layer are much more diffuse (near the origin in our scatter plots). We struggle to explain thisfinding but note that it is present across all of our models. A similar effect is present in our GRUmodel, where the first layer reset gates r are nearly never right-saturated and the update gates z arerarely ever left-saturated. This points towards a purely feed-forward mode of operation on this layer,where the previous hidden state is barely used.

4.3 UNDERSTANDING LONG-RANGE INTERACTIONS

Good performance of LSTMs is frequently attributed to their ability to store long-range information.In this section we test this hypothesis by comparing an LSTM with baseline models that can onlyutilize information from a fixed number of previous steps. In particular, we consider two baselines:

1. n-NN: A fully-connected neural network with one hidden layer and tanh nonlinearities. Theinput to the network is a sparse binary vector of dimension nK that concatenates the one-of-K

5


Figure 2: Several examples of cells with interpretable activations discovered in our best Linux Ker-nel and War and Peace LSTMs. Text color corresponds to tanh(c), where -1 is red and +1 is blue.

Figure 3: Left three: Saturation plots for an LSTM. Each circle is a gate in the LSTM and itsposition is determined by the fraction of time it is left or right-saturated. These fractions must add toat most one (indicated by the diagonal line). Right two: Saturation plot for a 3-layer GRU model.

gate to be left or right-saturated if its activation is less than 0.1 or more than 0.9, respectively, orunsaturated otherwise. We then compute the fraction of times that each LSTM gate spends left orright saturated, and plot the results in Figure 3. For instance, the number of often right-saturatedforget gates is particularly interesting, since this corresponds to cells that remember their valuesfor very long time periods. Note that there are multiple cells that are almost always right-saturated(showing up on bottom, right of the forget gate scatter plot), and hence function as nearly perfectintegrators. Conversely, there are no cells that function in purely feed-forward fashion, since theirforget gates would show up as consistently left-saturated (in top, left of the forget gate scatter plot).The output gate statistics also reveal that there are no cells that get consistently revealed or blockedto the hidden state. Lastly, a surprising finding is that unlike the other two layers that contain gateswith nearly binary regime of operation (frequently either left or right saturated), the activations inthe first layer are much more diffuse (near the origin in our scatter plots). We struggle to explain thisfinding but note that it is present across all of our models. A similar effect is present in our GRUmodel, where the first layer reset gates r are nearly never right-saturated and the update gates z arerarely ever left-saturated. This points towards a purely feed-forward mode of operation on this layer,where the previous hidden state is barely used.

4.3 UNDERSTANDING LONG-RANGE INTERACTIONS

Good performance of LSTMs is frequently attributed to their ability to store long-range information.In this section we test this hypothesis by comparing an LSTM with baseline models that can onlyutilize information from a fixed number of previous steps. In particular, we consider two baselines:

1. n-NN: A fully-connected neural network with one hidden layer and tanh nonlinearities. Theinput to the network is a sparse binary vector of dimension nK that concatenates the one-of-K

5

A tanh(cell)’s value.Red -1 <---> +1Blue

[Karpathy+15]


Word Ablation [Kádár+16]

• Analyzing a GRU’s output by omission scorewhen encoding a image caption

• Model predictingimage’s vector(CNN output)focuses on nouns

• Language modelfocuses more evenly

omission(i, S) = 1� cosine(hend(S),hend(S\i))(12)


• Removing word ‘pizza’ removes a just ’pizza’ from the image (searched from dataset)



Figure 2: Distribution of omission scores for POS (left) and deprel labels (right), for the TEXTUAL and VISUALpathways. Only labels which occur at least 500 times are included.

Figure 3: Distributions of log ratios of omission scores of TEXTUAL to VISUAL per POS (left) and deprel labels(right). Only labels which occur at least 500 times are included.

5.2 Omission score distributions

The omission scores can be used not only to esti-mate the importance of individual words, but alsoof syntactic categories. We estimate the salience ofeach syntactic category by accumulating the omis-sion scores for all words in that category. We tagevery word in a sentence with the part-of-speech(POS) category and the dependency relation (de-prel) label of its incoming arc. For example, forthe sentence the black dog, we get (the, DT, det),(black, JJ, amod), (dog, NN, root). Both POS tag-ging and dependency parsing are performed jointlyusing the TurboParser dependency parser (Martins

et al., 2013).3 The POS tags used are the Penn Tree-bank tags and the dependencies are the Stanford ba-sic dependencies.Figure 2 shows the distribution of omission scoresper POS and deprel label for the two pathways ofIMAGINET. The general trend is that for the VISUALpathway, the omission scores are high for a smallsubset of labels - corresponding mostly to nouns,less so for adjectives and even less for verbs - andlow for the rest (mostly function words and varioustypes of verbs). For TEXTUAL the differences aresmaller, and the pathway seems to be sensitive to

3Available at github.com/andre-martins/TurboParser.

• Analyzing mean omission scores in dataset on pos-tag, model for image focuses onNN > JJ > VB, CD > ...



• Connections of RNNs• Tree structure and RNN (LSTM)• Tree-based Composition by Shift-reduce

2. Connections and Trees


Figure 1: The computational graph of the HRED architecture for a dialogue composed of three turns. Each utterance isencoded into a dense vector and then mapped into the dialogue context, which is used to decode (generate) the tokens in thenext utterance. The encoder RNN encodes the tokens appearing within the utterance, and the context RNN encodes the temporalstructure of the utterances appearing so far in the dialogue, allowing information and gradients to flow over longer time spans.The decoder predicts one token at a time using a RNN. Adapted from Sordoni et al. (2015a).

the advantage that the embedding matrix E may separatelybe bootstrapped (e.g. learned) from larger corpora. Analo-gously, the matrix O 2 Rdh⇥|V | represents the output wordembeddings, where each possible next token is projectedinto another dense vector and compared to the hidden statehn. The probability of seeing token v at position n + 1 in-creases if its corresponding embedding vector Ov is “near”the context vector hn. The parameter H is called a recurrentparameter, because it links hn�1 to hn. All parameters arelearned by maximizing the log-likelihood of the parameterson a training set using stochastic gradient descent.

Hierarchical Recurrent Encoder-DecoderOur work extends the hierarchical recurrent encoder-decoder architecture (HRED) proposed by Sordoni etal. (2015a) for web query suggestion. In the original frame-work, HRED predicts the next web query given the queriesalready submitted by the user. The history of past submittedqueries is considered as a sequence at two levels: a sequenceof words for each web query and a sequence of queries.HRED models this hierarchy of sequences with two RNNs:one at the word level and one at the query level. We makea similar assumption, namely, that a dialogue can be seenas a sequence of utterances which, in turn, are sequences oftokens. A representation of HRED is given in Figure 1.

In dialogue, the encoder RNN maps each utterance to anutterance vector. The utterance vector is the hidden stateobtained after the last token of the utterance has been pro-cessed. The higher-level context RNN keeps track of past ut-terances by processing iteratively each utterance vector. Af-ter processing utterance Um, the hidden state of the contextRNN represents a summary of the dialogue up to and includ-

ing turn m, which is used to predict the next utterance Um+1.This hidden state can be interpreted as the continuous-valuedstate of the dialogue system. The next utterance prediction isperformed by means of a decoder RNN, which takes the hid-den state of the context RNN and produces a probability dis-tribution over the tokens in the next utterance. The decoderRNN is similar to the RNN language model (Mikolov et al.2010), but with the important difference that the predictionis conditioned on the hidden state of the context RNN. It canbe interpreted as the response generation module of the di-alogue system. The encoder, context and decoder RNNs allmake use of the GRU hidden unit (Cho et al. 2014). Every-where else we use the hyperbolic tangent as activation func-tion. It is also possible to use the maxout activation func-tion between the hidden state and the projected word em-beddings of the decoder RNN (Goodfellow et al. 2013). Thesame encoder RNN and decoder RNN parameters are usedfor every utterance in a dialogue. This helps the model gen-eralize across utterances. Further details of the architectureare described by Sordoni et al. (2015a).

For modeling dialogues, we expect the HRED model to besuperior to the standard RNN model for two reasons. First,because the context RNN allows the model to represent aform of common ground between speakers, e.g. to representtopics and concepts shared between the speakers using a dis-tributed vector representation, which we hypothesize to beimportant for building an effective dialogue system (Clarkand Brennan 1991). Second, because the number of com-putational steps between utterances is reduced. This makesthe objective function more stable w.r.t. the model parame-ters, and helps propagate the training signal for first-orderoptimization methods (Sordoni et al. 2015a).

• Clockwork RNN. Combination of different RNNs with different step processing [Koutník+14, Liu+15, (Chung+16)]

• Gated Feedback RNN. Feedback outputs into lower layers with gate [Chung+15]

• Depth-Gated LSTM, Highway LSTM: Cell are connected to the upper layer‘s cell with gate [Yao+15, Chen+15]

• k-th layer’s input is from k-1th’s input and output [Zhou+16]

• Hierarchical RNN[Serban+2015]

Connections in Multi-RNNs


Pixel Recurrent Neural Networks

x1

xi

xn

xn2

Figure 2. Left: To generate pixel xi one conditions on all the pre-viously generated pixels left and above of xi. Center: Illustrationof a Row LSTM with a kernel of size 3. The dependency field ofthe Row LSTM does not reach pixels further away on the sidesof the image. Right: Illustration of the two directions of the Di-agonal BiLSTM. The dependency field of the Diagonal BiLSTMcovers the entire available context in the image.

Figure 3. In the Diagonal BiLSTM, to allow for parallelizationalong the diagonals, the input map is skewed by offseting eachrow by one position with respect to the previous row. When thespatial layer is computed left to right and column by column, theoutput map is shifted back into the original size. The convolutionuses a kernel of size 2⇥ 1.

(2015); Uria et al. (2014)). By contrast we model p(x) asa discrete distribution, with every conditional distributionin Equation 2 being a multinomial that is modeled with asoftmax layer. Each channel variable xi,⇤ simply takes oneof 256 distinct values. The discrete distribution is represen-tationally simple and has the advantage of being arbitrarilymultimodal without prior on the shape. Experimentally wealso find the discrete distribution to be easy to learn andto produce better performance compared to a continuousdistribution (Section 5).

3. Pixel Recurrent Neural Networks

In this section we describe the architectural componentsthat compose the PixelRNN. In Sections 3.1 and 3.2, wedescribe the two types of LSTM layers that use convolu-tions to compute at once the states along one of the spatialdimensions. In Section 3.3 we describe how to incorporateresidual connections to improve the training of a PixelRNNwith many LSTM layers. In Section 3.4 we describe thesoftmax layer that computes the discrete joint distributionof the colors and the masking technique that ensures theproper conditioning scheme. In Section 3.5 we describe thePixelCNN architecture. Finally in Section 3.6 we describethe multi-scale architecture.

3.1. Row LSTM

The Row LSTM is a unidirectional layer that processesthe image row by row from top to bottom computing fea-tures for a whole row at once; the computation is per-formed with a one-dimensional convolution. For a pixelxi the layer captures a roughly triangular context above thepixel as shown in Figure 2 (center). The kernel of the one-dimensional convolution has size k ⇥ 1 where k � 3; thelarger the value of k the broader the context that is captured.The weight sharing in the convolution ensures translationinvariance of the computed features along each row.

The computation proceeds as follows. An LSTM layer hasan input-to-state component and a recurrent state-to-statecomponent that together determine the four gates inside theLSTM core. To enhance parallelization in the Row LSTMthe input-to-state component is first computed for the entiretwo-dimensional input map; for this a k ⇥ 1 convolution isused to follow the row-wise orientation of the LSTM itself.The convolution is masked to include only the valid context(see Section 3.4) and produces a tensor of size 4h⇥ n⇥ n,representing the four gate vectors for each position in theinput map, where h is the number of features in the LSTMlayer.

To compute one step of the state-to-state component ofthe LSTM layer, one is given the previous hidden and cellstates hi�1 and ci�1, each of size h ⇥ n ⇥ 1. The newhidden and cell states hi, ci are obtained as follows:

[oi, fi, ii,gi] = �(Kss ~ hi�1 + K

is ~ xi)

ci = fi � ci�1 + ii � gi

hi = oi � tanh(ci)

(3)

where xi of size h ⇥ n ⇥ 1 is row i of the input map, and~ represents the convolution operation and � the element-wise multiplication. The weights K

ss and K

is are thekernel weights for the state-to-state and the input-to-statecomponents, where the latter is precomputed as describedabove. In the case of the output, forget and input gatesoi, fi and ii, the activation � is the logistic sigmoid func-tion, whereas for the content gate gi, � is the tanh func-tion. Each step computes at once the new state for an en-tire row of the input map. Since the Row LSTM layer isunidirectional, it is relatively fast, but it has a considerabledrawback. Due to its roughly triangular shape, the recep-tive field induced by the layer misses a large portion of thepreviously generated context corresponding to the areas oneither side of the current pixel. For example, for a valueof k = 3 for the state-to-state convolution, which we findgives the best performance in the experiments, the recep-tive field for the pixels near the center of the image missesroughly half of the generated context (Figure 2).

Pixel Recurrent Neural Networks

x1

xi

xn

xn2

Figure 2. Left: To generate pixel xi one conditions on all the pre-viously generated pixels left and above of xi. Center: Illustrationof a Row LSTM with a kernel of size 3. The dependency field ofthe Row LSTM does not reach pixels further away on the sidesof the image. Right: Illustration of the two directions of the Di-agonal BiLSTM. The dependency field of the Diagonal BiLSTMcovers the entire available context in the image.

Figure 3. In the Diagonal BiLSTM, to allow for parallelizationalong the diagonals, the input map is skewed by offseting eachrow by one position with respect to the previous row. When thespatial layer is computed left to right and column by column, theoutput map is shifted back into the original size. The convolutionuses a kernel of size 2⇥ 1.

(2015); Uria et al. (2014)). By contrast we model p(x) asa discrete distribution, with every conditional distributionin Equation 2 being a multinomial that is modeled with asoftmax layer. Each channel variable xi,⇤ simply takes oneof 256 distinct values. The discrete distribution is represen-tationally simple and has the advantage of being arbitrarilymultimodal without prior on the shape. Experimentally wealso find the discrete distribution to be easy to learn andto produce better performance compared to a continuousdistribution (Section 5).

3. Pixel Recurrent Neural Networks

In this section we describe the architectural componentsthat compose the PixelRNN. In Sections 3.1 and 3.2, wedescribe the two types of LSTM layers that use convolu-tions to compute at once the states along one of the spatialdimensions. In Section 3.3 we describe how to incorporateresidual connections to improve the training of a PixelRNNwith many LSTM layers. In Section 3.4 we describe thesoftmax layer that computes the discrete joint distributionof the colors and the masking technique that ensures theproper conditioning scheme. In Section 3.5 we describe thePixelCNN architecture. Finally in Section 3.6 we describethe multi-scale architecture.

3.1. Row LSTM

The Row LSTM is a unidirectional layer that processesthe image row by row from top to bottom computing fea-tures for a whole row at once; the computation is per-formed with a one-dimensional convolution. For a pixelxi the layer captures a roughly triangular context above thepixel as shown in Figure 2 (center). The kernel of the one-dimensional convolution has size k ⇥ 1 where k � 3; thelarger the value of k the broader the context that is captured.The weight sharing in the convolution ensures translationinvariance of the computed features along each row.

The computation proceeds as follows. An LSTM layer hasan input-to-state component and a recurrent state-to-statecomponent that together determine the four gates inside theLSTM core. To enhance parallelization in the Row LSTMthe input-to-state component is first computed for the entiretwo-dimensional input map; for this a k ⇥ 1 convolution isused to follow the row-wise orientation of the LSTM itself.The convolution is masked to include only the valid context(see Section 3.4) and produces a tensor of size 4h⇥ n⇥ n,representing the four gate vectors for each position in theinput map, where h is the number of features in the LSTMlayer.

To compute one step of the state-to-state component ofthe LSTM layer, one is given the previous hidden and cellstates hi�1 and ci�1, each of size h ⇥ n ⇥ 1. The newhidden and cell states hi, ci are obtained as follows:

[oi, fi, ii,gi] = �(Kss ~ hi�1 + K

is ~ xi)

ci = fi � ci�1 + ii � gi

hi = oi � tanh(ci)

(3)

where xi of size h ⇥ n ⇥ 1 is row i of the input map, and~ represents the convolution operation and � the element-wise multiplication. The weights K

ss and K

is are thekernel weights for the state-to-state and the input-to-statecomponents, where the latter is precomputed as describedabove. In the case of the output, forget and input gatesoi, fi and ii, the activation � is the logistic sigmoid func-tion, whereas for the content gate gi, � is the tanh func-tion. Each step computes at once the new state for an en-tire row of the input map. Since the Row LSTM layer isunidirectional, it is relatively fast, but it has a considerabledrawback. Due to its roughly triangular shape, the recep-tive field induced by the layer misses a large portion of thepreviously generated context corresponding to the areas oneither side of the current pixel. For example, for a valueof k = 3 for the state-to-state convolution, which we findgives the best performance in the experiments, the recep-tive field for the pixels near the center of the image missesroughly half of the generated context (Figure 2).

• Grid LSTM. Each axis has each LSTM for multi-dimensional applications [Kalchbrenner+15]

• RNN for DAG, (image) pixel[Shuai+15, Zhu+16, Oord+16]

• Structure complexity of RNN model [Zhang+16]


2d Grid LSTM blockStandard LSTM block

m m0

h0h

h0

I ⇤ xi

h1

h2 h02

h01

m1

m01

m02m2

1d Grid LSTM Block 3d Grid LSTM Block

Figure 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2and 3 dimensions. The dashed lines indicate identity transformations. The standard LSTM blockdoes not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block hasthe memory vector m1 applied along the vertical dimension.

is used to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM withtwo dimensions is analogous to the Stacked LSTM, but it adds cells along the depth dimension too.Grid LSTM with three or more dimensions is analogous to Multidimensional LSTM (Graves et al.,2013; Sutskever et al., 2014; Graves et al., 2007; Graves, 2012), but differs from it not just by havingthe cells along the depth dimension, but also by using the proposed mechanism for modulating theN-way interaction that is not prone to the instability present in Multidimesional LSTM.

We study some of the learning properties of Grid LSTM in various algorithmic tasks. We comparethe performance of two-dimensional Grid LSTM to Stacked LSTM on computing the addition of two15-digit integers without curriculum learning and on memorizing sequences of numbers (Zaremba& Sutskever, 2014). We find that in these settings having cells along the depth dimension is moreeffective than not having them; similarly, tying the weights across the layers is also more effectivethan untying the weights, despite the reduced number of parameters.

We also apply Grid LSTM to two empirical tasks. The architecture achieves 1.47 bits-per-characterin the 100M characters Wikipedia dataset (Hutter, 2012) outperforming other neural networks. Sec-ondly, we use Grid LSTM to define a novel neural translation model that re-encodes the source sen-tence based on the target words generated up to that point. The network outperforms the referencephrase-based CDEC system (Dyer et al., 2010) on the IWSLT BTEC Chinese-to-Ensligh transla-tion task. The appendix contains additional results for Grid LSTM on learning parity functions andclassifying MNIST images.

The outline of the paper is as follows. In Sect. 2 we describe standard LSTM networks that comprisethe background. In Sect. 3 we define the Grid LSTM architecture. In Sect. 4 we consider the sixexperiments and we conclude in Sect. 5.

2 BACKGROUND

We begin by describing the standard LSTM recurrent neural network and the derived Stacked andMultidimensional LSTM networks; some aspects of the networks motivate the Grid LSTM.

2.1 LONG SHORT-TERM MEMORY

The LSTM network processes a sequence of input and target pairs (x1, y1), ..., (xm

, ym

). For eachpair (x

i

, yi

) the LSTM network takes the new input xi

and produces an estimate for the targety

i

given all the previous inputs x1, ..., xi

. The past inputs x1, ..., xi�1 determine the state of thenetwork that comprises a hidden vector h 2 Rd and a memory vector m 2 Rd. The computation at

2



m m0

h0h

h0

I ⇤ xi

h1

h2 h02

h01

m1

m01

m02m2







2 BACKGROUND




, ym

). For eachpair (x

i

, yi



i



2



m m0

h0h

h0

I ⇤ xi

h1

h2 h02

h01

m1

m01

m02m2







2 BACKGROUND




, ym

). For eachpair (x

i

, yi



i



2



m m0

h0h

h0

I ⇤ xi

h1

h2 h02

h01

m1

m01

m02m2







2 BACKGROUND




, ym

). For eachpair (x

i

, yi



i



2



m m0

h0h

h0

I ⇤ xi

h1

h2 h02

h01

m1

m01

m02m2







2 BACKGROUND




, ym

). For eachpair (x

i

, yi



i



2



m m0

h0h

h0

I ⇤ xi

h1

h2 h02

h01

m1

m01

m02m2







2 BACKGROUND




, ym

). For eachpair (x

i

, yi



i



2



m m0

h0h

h0

I ⇤ xi

h1

h2 h02

h01

m1

m01

m02m2







2 BACKGROUND




, ym

). For eachpair (x

i

, yi



i



2

Connections in Multi-RNNs


• Tree-LSTM [Tai+15] Apply LSTM to follow directed edges (child to parent) of tree structure. Most cited “Tree-LSTM”• S-LSTM [Zhu+15] Add peephole and remove input x• LSTM-RecursiveNN [Le+15] Control forget and input

gates with untied matrices of each cell and ouput (h). Input gate is applied before tanh.

• Top-down TreeLSTM [Zhang+16]Sentence generation fromroot of dependency tree

Tree-LSTM

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, pages 1556–1566,

Beijing, China, July 26-31, 2015. c�2015 Association for Computational Linguistics

Improved Semantic Representations FromTree-Structured Long Short-Term Memory Networks

Kai Sheng Tai, Richard Socher*, Christopher D. ManningComputer Science Department, Stanford University, *MetaMind Inc.

[email protected], [email protected], [email protected]

AbstractBecause of their superior ability to pre-serve sequence information over time,Long Short-Term Memory (LSTM) net-works, a type of recurrent neural net-work with a more complex computationalunit, have obtained strong results on a va-riety of sequence modeling tasks. Theonly underlying LSTM structure that hasbeen explored so far is a linear chain.However, natural language exhibits syn-tactic properties that would naturally com-bine words to phrases. We introduce theTree-LSTM, a generalization of LSTMs totree-structured network topologies. Tree-LSTMs outperform all existing systemsand strong LSTM baselines on two tasks:predicting the semantic relatedness of twosentences (SemEval 2014, Task 1) andsentiment classification (Stanford Senti-ment Treebank).

1 IntroductionMost models for distributed representations ofphrases and sentences—that is, models where real-valued vectors are used to represent meaning—fallinto one of three classes: bag-of-words models,sequence models, and tree-structured models. Inbag-of-words models, phrase and sentence repre-sentations are independent of word order; for ex-ample, they can be generated by averaging con-stituent word representations (Landauer and Du-mais, 1997; Foltz et al., 1998). In contrast, se-quence models construct sentence representationsas an order-sensitive function of the sequence oftokens (Elman, 1990; Mikolov, 2012). Lastly,tree-structured models compose each phrase andsentence representation from its constituent sub-phrases according to a given syntactic structureover the sentence (Goller and Kuchler, 1996;Socher et al., 2011).

x1 x2 x3 x4

y1 y2 y3 y4

x1

x2

x4 x5 x6

y1

y2 y3

y4 y6

Figure 1: Top: A chain-structured LSTM net-work. Bottom: A tree-structured LSTM networkwith arbitrary branching factor.

Order-insensitive models are insufficient tofully capture the semantics of natural languagedue to their inability to account for differences inmeaning as a result of differences in word orderor syntactic structure (e.g., “cats climb trees” vs.“trees climb cats”). We therefore turn to order-sensitive sequential or tree-structured models. Inparticular, tree-structured models are a linguisti-cally attractive option due to their relation to syn-tactic interpretations of sentence structure. A nat-ural question, then, is the following: to what ex-tent (if at all) can we do better with tree-structuredmodels as opposed to sequential models for sen-tence representation? In this paper, we work to-wards addressing this question by directly com-paring a type of sequential model that has recentlybeen used to achieve state-of-the-art results in sev-eral NLP tasks against its tree-structured general-ization.

Due to their capability for processing arbitrary-length sequences, recurrent neural networks

1556

w0

w0w1w2

w4 w5 w6

w0 w4 w5

GEN-L

GEN-NX-LGEN-NX-L

GEN-R

GEN-NX-R GEN-NX-R

w1w2w3

LD LD

Figure 4: Generation of left and right dependents of node w0

according to LDTREELSTM.

by input gate it and how much of the earlier mem-ory cell cl

t 0 will be forgotten is controlled by forgetgate ft . This process is computed as follows:

ut = tanh(Wz,lux · hl�1

t +Wz,luh · h

lt 0) (4a)

it = s(Wz,lix · hl�1

t +Wz,lih · hl

t 0) (4b)

ft = s(Wz,lf x · h

l�1t +Wz,l

f h · hlt 0) (4c)

clt = ft � cl

t 0 + it �ut (4d)

where Wz,lux 2 Rd⇥d (Wz,l

ux 2 Rd⇥s when l = 1) andWz,l

uh 2 Rd⇥d are weight matrices for ut , Wz,lix and

Wz,lih are weight matrices for it and Wz,l

f x, and Wz,lf h

are weight matrices for ft . s is a sigmoid functionand � the element-wise product.

Output gate ot controls how much information ofthe cell cl

t can be seen by other modules:

ot = s(Wz,lox · hl�1

t +Wz,loh · h

lt 0) (5a)

hlt = ot � tanh(cl

t) (5b)

Application of the above process to all layers L, willyield hL

t , which is ht . Note that in implementation,all cl

t and hlt (1 l L) at time step t are stored,

although we only care about hLt (ht).

2.4 Left Dependent Tree LSTMsTREELSTM computes P(w|D(w)) based on the de-pendency path D(w), which ignores the interactionbetween left and right dependents on the same level.In many cases, TREELSTM will use a verb to pre-dict its object directly without knowing its subject.For example, in Figure 2, TREELSTM uses hROOT,RIGHTi and h sold, RIGHT i to predict cars. This in-formation is unfortunately not specific to cars (manythings can be sold, e.g., chocolates, candy). Consid-ering manufacturer, the left dependent of sold wouldhelp predict cars more accurately.

In order to jointly take left and right dependentsinto account, we employ yet another LSTM, whichgoes from the furthest left dependent to the closestleft dependent (LD is a shorthand for left depen-dent). As shown in Figure 4, LD LSTM learns therepresentation of all left dependents of a node w0;this representation is then used to predict the firstright dependent of the same node. Non-first right de-pendents can also leverage the representation of leftdependents, since this information is injected intothe hidden state of the first right dependent and canpercolate all the way. Note that in order to retain thegeneration capability of our model (Section 3.4), weonly allow right dependents to leverage left depen-dents (they are generated before right dependents).

The computation of the LDTREELSTM is al-most the same as in TREELSTM except whenzt = GEN-R. In this case, let vt be the cor-responding left dependent sequence with lengthK (vt = (w3,w2,w1) in Figure 4). Then, the hiddenstate (qk) of vt at each time step k is:

mk = We · e(vt,k) (6a)

qk = LSTMLD(mk,qk�1) (6b)

where qK is the representation for all left depen-dents. Then, the computation of the current hid-den state becomes (see Equation (2) for the originalcomputation):

rt =

We · e(wt 0)

qK

�(7a)

ht = LSTMGEN-R(rt ,H[:, t 0]) (7b)

where qK serves as additional input for LSTMGEN-R.All other computational details are the same as inTreeLSTM (see Section 2.3).

2.5 Model TrainingOn small scale datasets we employ Negative Log-likelihood (NLL) as our training objective for bothTREELSTM and LDTREELSTM:

LNLL(q) =� 1|S | Â

S2SlogP(S|T ) (8)

where S is a sentence in the training set S , T is thedependency tree of S and P(S|T ) is defined as inEquation (1).


Figure 2: Attentional Encoder-Decoder model.

d

j

is calculated as the summation vector weightedby ↵

j

(i):

d

j

=

nX

i=1

↵

j

(i)h

i

. (6)

To incorporate the attention mechanism into thedecoding process, the context vector is used for thethe j-th word prediction by putting an additionalhidden layer ˜s

j

:

˜

s

j

= tanh(W

d

[s

j

; d

j

] + b

d

), (7)

where [s

j

;d

j

] 2 R2d⇥1 is the concatenation of sj

and d

j

, and W

d

2 Rd⇥2d and b

d

2 Rd⇥1 are aweight matrix and a bias vector, respectively. Themodel predicts the j-th word by using the softmaxfunction:

p(y

j

|y<j

,x) = softmax(Ws

˜

s

j

+ b

s

), (8)

where Ws

2 R|V |⇥d and b

s

2 R|V |⇥1 are a weightmatrix and a bias vector, respectively. |V | standsfor the size of the vocabulary of the target lan-guage. Figure 2 shows an example of the NMTmodel with the attention mechanism.

2.3 Objective Function of NMT ModelsThe objective function to train the NMT modelsis the sum of the log-likelihoods of the translationpairs in the training data:

J(✓) =

1

|D|X

(x,y)2D

log p(y|x), (9)

where D denotes a set of parallel sentence pairs.The model parameters ✓ are learned throughStochastic Gradient Descent (SGD).

3 Attentional Tree-to-Sequence Model3.1 Tree-based Encoder + Sequential

EncoderThe exsiting NMT models treat a sentence as asequence of words and neglect the structure of

Figure 3: Proposed model: Tree-to-sequence at-tentional NMT model.

a sentence inherent in language. We propose anovel tree-based encoder in order to explicitly takethe syntactic structure into consideration in theNMT model. We focus on the phrase structure ofa sentence and construct a sentence vector fromphrase vectors in a bottom-up fashion. The sen-tence vector in the tree-based encoder is there-fore composed of the structural information ratherthan the sequential data. Figure 3 shows our pro-posed model, which we call a tree-to-sequence at-tentional NMT model.

In Head-driven Phrase Structure Grammar(HPSG) (Sag et al., 2003), a sentence is composedof multiple phrase units and represented as a bi-nary tree as shown in Figure 1. Following thestructure of the sentence, we construct a tree-basedencoder on top of the standard sequential encoder.The k-th parent hidden unit h

(phr)k

for the k-thphrase is calculated using the left and right childhidden units hl

k

and h

r

k

as follows:

h

(phr)k

= f

tree

(h

l

k

,h

r

k

), (10)

where f

tree

is a non-linear function.

We construct a tree-based encoder with LSTMunits, where each node in the binary tree is repre-sented with an LSTM unit. When initializing theleaf units of the tree-based encoder, we employ thesequential LSTM units described in Section 2.1.Each non-leaf node is also represented with anLSTM unit, and we employ Tree-LSTM (Tai etal., 2015) to calculate the LSTM unit of the par-ent node which has two child LSTM units. Thehidden unit h(phr)

k

2 Rd⇥1 and the memory cellc

(phr)k

2 Rd⇥1 for the k-th parent node are calcu-

• Tree-based and Sequential Encoder for Attention. [Eriguchi+16]• Tree-LSTM composition

with leaf nodes outputfrom seq-LSTM

• “The cutest approach!”,Kyunghyun Cho said at SedMT, NAACL16.

• Undercoated seq-LSTM makes nodesmore context-aware and less ambiguous.+[Bowman+16]

Combination of Tree and Seq


The hungry cat

NP (VP(S

REDU

CEGE

NNT

(NP)

NT(VP

)

…

cat hungry Thea<t

p(at)

utTt� �� St� ��

Figure 5: Neural architecture for defining a distribution over at given representations of the stack (St), output buffer (Tt) andhistory of actions (a<t). Details of the composition architecture of the NP, the action history LSTM, and the other elements of thestack are not shown. This architecture corresponds to the generator state at line 7 of Figure 4.

of the forward and reverse LSTMs are concatenated,passed through an affine transformation and a tanh

nonlinearity to become the subtree embedding.4 Be-cause each of the child node embeddings (u, v, w inFig. 6) is computed similarly (if it corresponds to aninternal node), this composition function is a kind ofrecursive neural network.

4.2 Word Generation

To reduce the size of AG(S, T, n), word genera-tion is broken into two parts. First, the decision togenerate is made (by predicting GEN as an action),and then choosing the word, conditional on the cur-rent parser state. To further reduce the computa-tional complexity of modeling the generation of aword, we use a class-factored softmax (Baltescu andBlunsom, 2015; Goodman, 2001). By using

p|⌃|

classes for a vocabulary of size |⌃|, this predictionstep runs in time O(

p|⌃|) rather than the O(|⌃|) of

the full-vocabulary softmax. To obtain clusters, weuse the greedy agglomerative clustering algorithmof Brown et al. (1992).

4.3 Training

The parameters in the model are learned to maxi-mize the likelihood of a corpus of trees.

4We found the many previously proposed syntactic compo-sition functions inadequate for our purposes. First, we mustcontend with an unbounded number of children, and manypreviously proposed functions are limited to binary branchingnodes (Socher et al., 2013b; Dyer et al., 2015). Second, thosethat could deal with n-ary nodes made poor use of nonterminalinformation (Tai et al., 2015), which is crucial for our task.

4.4 Discriminative Parsing Model

A discriminative parsing model can be obtained byreplacing the embedding of Tt at each time step withan embedding of the input buffer Bt. To train thismodel, the conditional likelihood of each sequenceof actions given the input string is maximized.5

5 Inference via Importance Sampling

Our generative model p(x, y) defines a joint dis-tribution on trees (y) and sequences of words (x).To evaluate this as a language model, it is neces-sary to compute the marginal probability p(x) =P

y

02Y(x) p(x, y0). And, to evaluate the model as

a parser, we need to be able to find the MAP parsetree, i.e., the tree y 2 Y(x) that maximizes p(x, y).However, because of the unbounded dependenciesacross the sequence of parsing actions in our model,exactly solving either of these inference problemsis intractable. To obtain estimates of these, we usea variant of importance sampling (Doucet and Jo-hansen, 2011).

Our importance sampling algorithm uses a condi-tional proposal distribution q(y | x) with the fol-lowing properties: (i) p(x, y) > 0 =) q(y |

x) > 0; (ii) samples y ⇠ q(y | x) can be ob-tained efficiently; and (iii) the conditional probabil-ities q(y | x) of these samples are known. Whilemany such distributions are available, the discrim-

5For the discriminative parser, the POS tags are processedsimilarly as in (Dyer et al., 2015); they are predicted for Englishwith the Stanford Tagger (Toutanova et al., 2003) and Chinesewith Marmot (Mueller et al., 2013).

• Generation by sequential actions from{GEN(word), REDUCE, NT(non-terminal symbol)}. Features for action decisions are LSTM’s outputs of (1) terminals (2) stack (3) action history.

Recurrent Neural Network Grammars

applying a number of pops that is linear in the num-ber of input symbols, the total number of pop opera-tions across an entire parse/generation run will alsobe linear). Since there is no way to bound the num-ber of output nodes in a parse tree as a function ofthe number of input words, stating the runtime com-plexity of the parsing algorithm as a function of theinput size requires further assumptions. Assumingour fixed constraint on maximum depth, it is linear.

3.5 Comparison to Other Models

Our generation algorithm algorithm differs fromprevious stack-based parsing/generation algorithmsin two ways. First, it constructs rooted tree struc-tures top down (rather than bottom up), and sec-ond, the transition operators are capable of directlygenerating arbitrary tree structures rather than, e.g.,assuming binarized trees, as is the case in muchprior work that has used transition-based algorithmsto produce phrase-structure trees (Sagae and Lavie,2005; Zhang and Clark, 2011; Zhu et al., 2013).

4 Generative Model

RNNGs use the generator transition set just pre-sented to define a joint distribution on syntax trees(y) and words (x). This distribution is defined as asequence model over generator transitions that is pa-rameterized using a continuous space embedding ofthe algorithm state at each time step (ut); i.e.,

p(x, y) =

|a(x,y)|Y

t=1

p(at | a<t)

=

|a(x,y)|Y

t=1

exp r

>atut + batP

a02AG(Tt,St,nt) exp r

>a0ut + ba0

,

and where action-specific embeddings ra and biasvector b are parameters in ⇥.

The representation of the algorithm state at timet, ut, is computed by combining the representationof the generator’s three data structures: the outputbuffer (Tt), represented by an embedding ot, thestack (St), represented by an embedding st, and thehistory of actions (a<t) taken by the generator, rep-resented by an embedding ht,

ut = tanh (W[ot; st;ht] + c) ,

where W and c are parameters. Refer to Figure 5for an illustration of the architecture.

The output buffer, stack, and history are se-quences that grow unboundedly, and to obtain rep-resentations of them we use recurrent neural net-works to “encode” their contents (Cho et al., 2014).Since the output buffer and history of actions areonly appended to and only contain symbols from afinite alphabet, it is straightforward to apply a stan-dard RNN encoding architecture. The stack (S) ismore complicated for two reasons. First, the ele-ments of the stack are more complicated objects thansymbols from a discrete alphabet: open nontermi-nals, terminals, and full trees, are all present on thestack. Second, it is manipulated using both push andpop operations. To efficiently obtain representationsof S under push and pop operations, we use stackLSTMs (Dyer et al., 2015).

4.1 Syntactic Composition FunctionWhen a REDUCE operation is executed, the parserpops a sequence of completed subtrees and/or to-kens (together with their vector embeddings) fromthe stack and makes them children of the most recentopen nonterminal on the stack, “completing” theconstituent. To compute an embedding of this newsubtree, we use a composition function based onbidirectional LSTMs, which is illustrated in Fig. 6.

NP

u v w

NPu v w

NP

x

x

Figure 6: Syntactic composition function based on bidirec-tional LSTMs that is executed during a REDUCE operation; thenetwork on the right models the structure on the left.

The first vector read by the LSTM in both the for-ward and reverse directions is an embedding of thelabel on the constituent being constructed (in the fig-ure, NP). This is followed by the embeddings of thechild subtrees (or tokens) in forward or reverse or-der. Intuitively, this order serves to “notify” eachLSTM what sort of head it should be looking for as itprocesses the child node embeddings. The final state

[Dyer+16]

Stackt Buffert Open NTst Action Stackt+1 Buffert+1 Open NTst+1

S B n NT(X) S | (X B n + 1

S x | B n SHIFT S | x B nS | (X | ⌧1 | . . . | ⌧` B n REDUCE S | (X ⌧1 . . . ⌧`) B n � 1

Figure 1: Parser transitions showing the stack, buffer, and open nonterminal count before and after each action type. S representsthe stack, which contains open nonterminals and completed subtrees; B represents the buffer of unprocessed terminal symbols; xis a terminal symbol, X is a nonterminal symbol, and each ⌧ is a completed subtree. The top of the stack is to the right, and thebuffer is consumed from left to right. Elements on the stack and buffer are delimited by a vertical bar ( | ).

Input: The hungry cat meows .Stack Buffer Action

0 The | hungry | cat |meows | . NT(S)1 (S The | hungry | cat |meows | . NT(NP)2 (S | (NP The | hungry | cat |meows | . SHIFT3 (S | (NP | The hungry | cat |meows | . SHIFT4 (S | (NP | The | hungry cat |meows | . SHIFT5 (S | (NP | The | hungry | cat meows | . REDUCE6 (S | (NP The hungry cat) meows | . NT(VP)7 (S | (NP The hungry cat) | (VP meows | . SHIFT8 (S | (NP The hungry cat) | (VP meows . REDUCE9 (S | (NP The hungry cat) | (VP meows) . SHIFT

10 (S | (NP The hungry cat) | (VP meows) | . REDUCE11 (S (NP The hungry cat) (VP meows) .)

Figure 2: Top-down parsing example.

Stackt Termst Open NTst Action Stackt+1 Termst+1 Open NTst+1

S T n NT(X) S | (X T n + 1

S T n GEN(x) S | x T | x nS | (X | ⌧1 | . . . | ⌧` T n REDUCE S | (X ⌧1 . . . ⌧`) T n � 1

Figure 3: Generator transitions. Symbols defined as in Fig. 1 with the addition of T representing the history of generated terminals.

Stack Terminals Action0 NT(S)1 (S NT(NP)2 (S | (NP GEN(The)3 (S | (NP | The The GEN(hungry)4 (S | (NP | The | hungry The | hungry GEN(cat)5 (S | (NP | The | hungry | cat The | hungry | cat REDUCE6 (S | (NP The hungry cat) The | hungry | cat NT(VP)7 (S | (NP The hungry cat) | (VP The | hungry | cat GEN(meows)8 (S | (NP The hungry cat) | (VP meows The | hungry | cat |meows REDUCE9 (S | (NP The hungry cat) | (VP meows) The | hungry | cat |meows GEN(.)

10 (S | (NP The hungry cat) | (VP meows) | . The | hungry | cat |meows | . REDUCE11 (S (NP The hungry cat) (VP meows) .) The | hungry | cat |meows | .

Figure 4: Joint generation of a parse tree and sentence.

• REDUCE action weaves a new chunk vector by bi-LSTMand re-stacks it.“NP→the→hungry→cat”


• Joint learning of shift-reduce parsing and sentence-level classification with shift-reduce-based tree-LSTM composition. When REDUCE the top 2 chunks are composed by tree-LSTM.

• Speedy tree composition (like Recurrent NN).

SPINN [Bowman+16]

bu�er downsat

stackcatthe

composition

trackingtransition

��

downsat

the cat composition

trackingtransition

��

down

satthe cat

tracking

(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.

bu�er

stack

t = 0

downsatcatthe

��

t = 1

downsatcat

the

��

t = 2

downsat

catthe

��

t = 3

downsat

the cat

��

t = 4

down

satthe cat

��

t = 5

downsat

the cat

��

t = 6

sat downthe cat

��

t = 7 = T

(the cat) (sat down)

output to modelfor semantic task

(b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.

Figure 2: Two views of the Stack-augmented Parser-Interpreter Neural Network (SPINN).

largely been overlooked in favor of sequence-based RNNs because of their incompatibility withbatched computation and their reliance on externalparsers. Batched computation—performing syn-chronized computation across many examples atonce—yields order-of-magnitude improvements inmodel run time, and is crucial in enabling neuralnetworks to be trained e�ciently on large datasets.Because TreeRNNs use a di�erent model structurefor each sentence, as in Figure 1, e�cient batchingis impossible in standard implementations. Partlyto address e�ciency problems, standard TreeRNNmodels commonly only operate on sentences thathave already been processed by a syntactic parser,which slows and complicates the use of these mod-els at test time for most applications.

This paper introduces a new model to addressboth these issues: the Stack-augmented Parser-Interpreter Neural Network, or SPINN, shown inFigure 2. SPINN executes the computations of atree-structured model in a linearized sequence, andcan incorporate a neural network parser that pro-duces the required parse structure on the fly. Thisdesign improves upon the TreeRNN architecturein three ways: At test time, it can simultaneouslyparse and interpret unparsed sentences, removingthe dependence on an external parser at nearly noadditional computational cost. Secondly, it sup-ports batched computation for both parsed and un-parsed sentences, yielding dramatic speedups over

standard TreeRNNs. Finally, it supports a noveltree-sequence hybrid architecture for handling lo-cal linear context in sentence interpretation. Thismodel is a basically plausible model of human sen-tence processing and yields substantial accuracygains over pure sequence- or tree-based models.

We evaluate SPINN on the Stanford Natural Lan-guage Inference entailment task (SNLI, Bowmanet al., 2015a), and find that it significantly out-performs other sentence-encoding-based models,even with a relatively simple and underpoweredimplementation of the built-in parser. We also findthat SPINN yields speed increases of up to 25⇥over a standard TreeRNN implementation.

2 Related work

There is a fairly long history of work on buildingneural network-based parsers that use the core op-erations and data structures from transition-basedparsing, of which shift-reduce parsing is a variant(Henderson, 2004; Emami and Jelinek, 2005; Titovand Henderson, 2010; Chen and Manning, 2014;Buys and Blunsom, 2015; Dyer et al., 2015; Kiper-wasser and Goldberg, 2016). In addition, there hasbeen recent work proposing models designed pri-marily for generative language modeling tasks thatuse this architecture as well (Zhang et al., 2016;Dyer et al., 2016). To our knowledge, SPINN isthe first model to use this architecture for the pur-pose of sentence interpretation, rather than parsing

bu�er downsat

stackcatthe

composition

trackingtransition

��

downsat

the cat composition

trackingtransition

��

down

satthe cat

tracking

(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.

bu�er

stack

t = 0

downsatcatthe

��

t = 1

downsatcat

the

��

t = 2

downsat

catthe

��

t = 3

downsat

the cat

��

t = 4

down

satthe cat

��

t = 5

downsat

the cat

��

t = 6

sat downthe cat

��

t = 7 = T

(the cat) (sat down)

output to modelfor semantic task

(b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.

Figure 2: Two views of the Stack-augmented Parser-Interpreter Neural Network (SPINN).

largely been overlooked in favor of sequence-based RNNs because of their incompatibility withbatched computation and their reliance on externalparsers. Batched computation—performing syn-chronized computation across many examples atonce—yields order-of-magnitude improvements inmodel run time, and is crucial in enabling neuralnetworks to be trained e�ciently on large datasets.Because TreeRNNs use a di�erent model structurefor each sentence, as in Figure 1, e�cient batchingis impossible in standard implementations. Partlyto address e�ciency problems, standard TreeRNNmodels commonly only operate on sentences thathave already been processed by a syntactic parser,which slows and complicates the use of these mod-els at test time for most applications.

This paper introduces a new model to addressboth these issues: the Stack-augmented Parser-Interpreter Neural Network, or SPINN, shown inFigure 2. SPINN executes the computations of atree-structured model in a linearized sequence, andcan incorporate a neural network parser that pro-duces the required parse structure on the fly. Thisdesign improves upon the TreeRNN architecturein three ways: At test time, it can simultaneouslyparse and interpret unparsed sentences, removingthe dependence on an external parser at nearly noadditional computational cost. Secondly, it sup-ports batched computation for both parsed and un-parsed sentences, yielding dramatic speedups over

standard TreeRNNs. Finally, it supports a noveltree-sequence hybrid architecture for handling lo-cal linear context in sentence interpretation. Thismodel is a basically plausible model of human sen-tence processing and yields substantial accuracygains over pure sequence- or tree-based models.

We evaluate SPINN on the Stanford Natural Lan-guage Inference entailment task (SNLI, Bowmanet al., 2015a), and find that it significantly out-performs other sentence-encoding-based models,even with a relatively simple and underpoweredimplementation of the built-in parser. We also findthat SPINN yields speed increases of up to 25⇥over a standard TreeRNN implementation.

2 Related work

There is a fairly long history of work on buildingneural network-based parsers that use the core op-erations and data structures from transition-basedparsing, of which shift-reduce parsing is a variant(Henderson, 2004; Emami and Jelinek, 2005; Titovand Henderson, 2010; Chen and Manning, 2014;Buys and Blunsom, 2015; Dyer et al., 2015; Kiper-wasser and Goldberg, 2016). In addition, there hasbeen recent work proposing models designed pri-marily for generative language modeling tasks thatuse this architecture as well (Zhang et al., 2016;Dyer et al., 2016). To our knowledge, SPINN isthe first model to use this architecture for the pur-pose of sentence interpretation, rather than parsing

Stack-augmented Parser-InterpreterNeural Network


• Repeated attention withLSTM capturesinput vectors as a set[Vinyals+15]

• (End-to-end) Memory Networks [Sukhbaatar+15]Sentence encoding by weighted sum.Earlier word(’s vector) has larger weights at smaller dims.Later word(‘s vector) has larger weights at larger dims.

• e.g., When sentence length d is 10 and vector dimension J is 20,value of 1st vec at 1st dim: (1-1/20)-(1/10)(1-2*1/20) = 0.86value of 1st vec at 20th dim: (1-20/20)-(1/10)(1-2*20/20) = 0.1value of 10th vec at 1st dim: (1-1/20)-(10/10)(1-2*1/20) = 0.05value of 10th vec at 20th dim: (1-20/20)-(10/10)(1-2*20/20) = 1.0

(Encoders without C/RNN)

column vector with the structure lkj = (1� j/J)� (k/d)(1� 2j/J) (assuming 1-based indexing),with J being the number of words in the sentence, and d is the dimension of the embedding. Thissentence representation, which we call position encoding (PE), means that the order of the wordsnow affects mi. The same representation is used for questions, memory inputs and memory outputs.

Temporal Encoding: Many of the QA tasks require some notion of temporal context, i.e. inthe first example of Section 2, the model needs to understand that Sam is in the bedroom afterhe is in the kitchen. To enable our model to address them, we modify the memory vector sothat mi =

Pj Axij + TA(i), where TA(i) is the ith row of a special matrix TA that encodes

temporal information. The output embedding is augmented in the same way with a matrix Tc

(e.g. ci =P

j Cxij + TC(i)). Both TA and TC are learned during training. They are also subject tothe same sharing constraints as A and C. Note that sentences are indexed in reverse order, reflectingtheir relative distance from the question so that x1 is the last sentence of the story.

Learning time invariance by injecting random noise: we have found it helpful to add “dummy”memories to regularize TA. That is, at training time we can randomly add 10% of empty memoriesto the stories. We refer to this approach as random noise (RN).

4.2 Training Details10% of the bAbI training set was held-out to form a validation set, which was used to select theoptimal model architecture and hyperparameters. Our models were trained using a learning rate of⌘ = 0.01, with anneals every 25 epochs by ⌘/2 until 100 epochs were reached. No momentum orweight decay was used. The weights were initialized randomly from a Gaussian distribution withzero mean and � = 0.1. When trained on all tasks simultaneously with 1k training samples (10ktraining samples), 60 epochs (20 epochs) were used with learning rate anneals of ⌘/2 every 15epochs (5 epochs). All training uses a batch size of 32 (but cost is not averaged over a batch), andgradients with an `2 norm larger than 40 are divided by a scalar to have norm 40. In some of ourexperiments, we explored commencing training with the softmax in each memory layer removed,making the model entirely linear except for the final softmax for answer prediction. When thevalidation loss stopped decreasing, the softmax layers were re-inserted and training recommenced.We refer to this as linear start (LS) training. In LS training, the initial learning rate is set to⌘ = 0.005. The capacity of memory is restricted to the most recent 50 sentences. Since the numberof sentences and the number of words per sentence varied between problems, a null symbol wasused to pad them all to a fixed size. The embedding of the null symbol was constrained to be zero.

On some tasks, we observed a large variance in the performance of our model (i.e. sometimes failingbadly, other times not, depending on the initialization). To remedy this, we repeated each training10 times with different random initializations, and picked the one with the lowest training error.

4.3 Baselines

We compare our approach2 (abbreviated to MemN2N) to a range of alternate models:

• MemNN: The strongly supervised AM+NG+NL Memory Networks approach, proposed in [22].This is the best reported approach in that paper. It uses a max operation (rather than softmax) ateach layer which is trained directly with supporting facts (strong supervision). It employs n-grammodeling, nonlinear layers and an adaptive number of hops per query.

• MemNN-WSH: A weakly supervised heuristic version of MemNN where the supporting sen-tence labels are not used in training. Since we are unable to backpropagate through the maxoperations in each layer, we enforce that the first memory hop should share at least one word withthe question, and that the second memory hop should share at least one word with the first hop andat least one word with the answer. All those memories that conform are called valid memories,and the goal during training is to rank them higher than invalid memories using the same rankingcriteria as during strongly supervised training.

• LSTM: A standard LSTM model, trained using question / answer pairs only (i.e. also weaklysupervised). For more detail, see [22].

2 MemN2N source code is available at https://github.com/facebook/MemNN.

5

Our model is also related to Bahdanau et al. [2]. In that work, a bidirectional RNN based encoderand gated RNN based decoder were used for machine translation. The decoder uses an attentionmodel that finds which hidden states from the encoding are most useful for outputting the nexttranslated word; the attention model uses a small neural network that takes as input a concatenationof the current hidden state of the decoder and each of the encoders hidden states. A similar attentionmodel is also used in Xu et al. [24] for generating image captions. Our “memory” is analogous totheir attention mechanism, although [2] is only over a single sentence rather than many, as in ourcase. Furthermore, our model makes several hops on the memory before making an output; we willsee below that this is important for good performance. There are also differences in the architectureof the small network used to score the memories compared to our scoring approach; we use a simplelinear layer, whereas they use a more sophisticated gated architecture.

We will apply our model to language modeling, an extensively studied task. Goodman [6] showedsimple but effective approaches which combine n-grams with a cache. Bengio et al. [3] ignitedinterest in using neural network based models for the task, with RNNs [14] and LSTMs [10, 20]showing clear performance gains over traditional methods. Indeed, the current state-of-the-art isheld by variants of these models, for example very large LSTMs with Dropout [25] or RNNs withdiagonal constraints on the weight matrix [15]. With appropriate weight tying, our model can beregarded as a modified form of RNN, where the recurrence is indexed by memory lookups to theword sequence rather than indexed by the sequence itself.

4 Synthetic Question and Answering ExperimentsWe perform experiments on the synthetic QA tasks defined in [22] (using version 1.1 of the dataset).A given QA task consists of a set of statements, followed by a question whose answer is typicallya single word (in a few tasks, answers are a set of words). The answer is available to the model attraining time, but must be predicted at test time. There are a total of 20 different types of tasks thatprobe different forms of reasoning and deduction. Here are samples of three of the tasks:Sam walks into the kitchen. Brian is a lion. Mary journeyed to the den.Sam picks up an apple. Julius is a lion. Mary went back to the kitchen.Sam walks into the bedroom. Julius is white. John journeyed to the bedroom.Sam drops the apple. Bernhard is green. Mary discarded the milk.Q: Where is the apple? Q: What color is Brian? Q: Where was the milk before the den?A. Bedroom A. White A. Hallway

Note that for each question, only some subset of the statements contain information needed forthe answer, and the others are essentially irrelevant distractors (e.g. the first sentence in the firstexample). In the Memory Networks of Weston et al. [22], this supporting subset was explicitlyindicated to the model during training and the key difference between that work and this one is thatthis information is no longer provided. Hence, the model must deduce for itself at training and testtime which sentences are relevant and which are not.

Formally, for one of the 20 QA tasks, we are given example problems, each having a set of Isentences {xi} where I 320; a question sentence q and answer a. Let the jth word of sentencei be xij , represented by a one-hot vector of length V (where the vocabulary is of size V = 177,reflecting the simplistic nature of the QA language). The same representation is used for thequestion q and answer a. Two versions of the data are used, one that has 1000 training problemsper task and a second larger one with 10,000 per task.

4.1 Model DetailsUnless otherwise stated, all experiments used a K = 3 hops model with the adjacent weight sharingscheme. For all tasks that output lists (i.e. the answers are multiple words), we take each possiblecombination of possible outputs and record them as a separate answer vocabulary word.

Sentence Representation: In our experiments we explore two different representations forthe sentences. The first is the bag-of-words (BoW) representation that takes the sentencexi = {xi1, xi2, ..., xin}, embeds each word and sums the resulting vectors: e.g mi =

Pj Axij and

ci =P

j Cxij . The input vector u representing the question is also embedded as a bag of words:u =

Pj Bqj . This has the drawback that it cannot capture the order of the words in the sentence,

which is important for some tasks.

We therefore propose a second representation that encodes the position of words within thesentence. This takes the form: mi =

Pj lj ·Axij , where · is an element-wise multiplication. lj is a

4

Published as a conference paper at ICLR 2016

All these empirical findings point to the same story: often for optimization purposes, the order inwhich input data is shown to the model has an impact on the learning performance.

Note that we can define an ordering which is independent of the input sequence or set X (e.g., alwaysreversing the words in a translation task), but also an ordering which is input dependent (e.g., sortingthe input points in the convex hull case). This distinction also applies in the discussion about outputsequences and sets in Section 5.1.

Recent approaches which pushed the seq2seq paradigm further by adding memory and computationto these models allowed us to define a model which makes no assumptions about input ordering,whilst preserving the right properties which we just discussed: a memory that increases with thesize of the set, and which is order invariant. In the next sections, we explain such a modification,which could also be seen as a special case of a Memory Network (Weston et al., 2015) or NeuralTuring Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.

4.2 ATTENTION MECHANISMS

Neural models with memories coupled to differentiable addressing mechanism have been success-fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,2015). Since we are interested in associative memories we employed a “content” based attention.This has the property that the vector retrieved from our memory would not change if we randomlyshuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,our process block based on an attention mechanism uses the following:

qt = LSTM(q⇤t�1) (3)ei,t = f(mi, qt) (4)

ai,t =

exp(ei,t)Pj exp(ej,t)

(5)

rt =

X

i

ai,tmi (6)

q⇤t = [qt rt] (7)

Read

Process Write

Figure 1: The Read-Process-and-Write model.

where i indexes through each memory vector mi (typically equal to the cardinality of X), qt isa query vector which allows us to read rt from the memories, f is a function that computes asingle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes arecurrent state but which takes no inputs. q⇤t is the state which this LSTM evolves, and is formedby concatenating the query qt with the resulting attention readout rt. t is the index which indicateshow many “processing steps” are being carried to compute the state to be fed to the decoder. Notethat permuting mi and mi0 has no effect on the read vector rt.

4.3 READ, PROCESS, WRITE

Our model, which naturally handles input sets, has three components (the exact equations and im-plementation will be released in an appendix prior to publication):

• A reading block, which simply embeds each element xi 2 X using a small neural networkonto a memory vector mi (the same neural network is used for all i).

• A process block, which is an LSTM without inputs or outputs performing T steps of com-putation over the memories mi. This LSTM keeps updating its state by reading mi repeat-edly using the attention mechanism described in the previous section. At the end of thisblock, its hidden state q⇤T is an embedding which is permutation invariant to the inputs. Seeeqs. (3)-(7) for more details.

4


All these empirical findings point to the same story: often for optimization purposes, the order inwhich input data is shown to the model has an impact on the learning performance.

Note that we can define an ordering which is independent of the input sequence or set X (e.g., alwaysreversing the words in a translation task), but also an ordering which is input dependent (e.g., sortingthe input points in the convex hull case). This distinction also applies in the discussion about outputsequences and sets in Section 5.1.

Recent approaches which pushed the seq2seq paradigm further by adding memory and computationto these models allowed us to define a model which makes no assumptions about input ordering,whilst preserving the right properties which we just discussed: a memory that increases with thesize of the set, and which is order invariant. In the next sections, we explain such a modification,which could also be seen as a special case of a Memory Network (Weston et al., 2015) or NeuralTuring Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.

4.2 ATTENTION MECHANISMS

Neural models with memories coupled to differentiable addressing mechanism have been success-fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,2015). Since we are interested in associative memories we employed a “content” based attention.This has the property that the vector retrieved from our memory would not change if we randomlyshuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,our process block based on an attention mechanism uses the following:

qt = LSTM(q⇤t�1) (3)ei,t = f(mi, qt) (4)

ai,t =

exp(ei,t)Pj exp(ej,t)

(5)

rt =

X

i

ai,tmi (6)

q⇤t = [qt rt] (7)

Read

Process Write

Figure 1: The Read-Process-and-Write model.

where i indexes through each memory vector mi (typically equal to the cardinality of X), qt isa query vector which allows us to read rt from the memories, f is a function that computes asingle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes arecurrent state but which takes no inputs. q⇤t is the state which this LSTM evolves, and is formedby concatenating the query qt with the resulting attention readout rt. t is the index which indicateshow many “processing steps” are being carried to compute the state to be fed to the decoder. Notethat permuting mi and mi0 has no effect on the read vector rt.

4.3 READ, PROCESS, WRITE

Our model, which naturally handles input sets, has three components (the exact equations and im-plementation will be released in an appendix prior to publication):

• A reading block, which simply embeds each element xi 2 X using a small neural networkonto a memory vector mi (the same neural network is used for all i).

• A process block, which is an LSTM without inputs or outputs performing T steps of com-putation over the memories mi. This LSTM keeps updating its state by reading mi repeat-edly using the attention mechanism described in the previous section. At the end of thisblock, its hidden state q⇤T is an embedding which is permutation invariant to the inputs. Seeeqs. (3)-(7) for more details.

4


• Regularizations• Dropout in RNN• Batch Normalization in RNN• Other Regularizations• Multi-task learning and pre-training of encoder(-decoder)

3. Learning Tricks


• Dropout [Hinton+12, Srivastava+14]Drop nodes by probability p and multiply 1/(1-p).• RNN (mainly LSTM and GRU) need some tricks• At upward (inter-layer) connections [Zaremba+14]

• At update terms[Semeniuta+16]

• Use one consistent dropout mask in one seq. (Effect looks obscure now) [Semeniuta+16, Gal15]

• Zoneout. Stochastically preserveprevious c and h [Krueger+16]

• Word dropout. Stochastically use zero/mean/<unk> vec as a word vec. [Iyyer+15, Dai&Le15, Dyer+15, Bowman+15]

Dropout

where p is the dropout rate and mask is a vector,sampled from the Bernoulli distribution with suc-cess probability 1− p.

3.2 Dropout in vanilla RNNsVanilla RNNs were one of the first ideas for mod-eling sequential data with neural networks. For-mally, RNNs process the input sequences as fol-lows:

ht = f(Wh[xt,ht−1] + bh), (3)

where xt is the input at time step t; ht and ht−1

are hidden vectors that encode the current andprevious states of the network; Wh is parametermatrix that models input-to-hidden and hidden-to-hidden (recurrent) connections; b is a vector ofbias terms, and f is the activation function.As RNNs model sequential data by a fully-

connected layer, we apply dropout in the sameway as it is applied in the feed-forward networks.Specifically, we modify Equation 3 in the follow-ing way:

ht = f(Wh[xt, d(ht−1)] + bh), (4)

where d is the dropout function from Equation 2.While recurrent and feed-forward fully connectedlayers are similar, there is a significant difference:in feed-forward networks every fully-connectedlayer processes its input only once, while it is notthe case for the recurrent layer: each training ex-ample is a sequence composed of a number of in-puts. When applying dropout this results in hid-den state being dropped on every step. This obser-vation leads to the question of how to sample thedropout mask. There are two options: sample itonce per whole training sequence (per-sequence)or to sample a new mask on every step (per-step). We discuss these two strategies for samplingdropout mask in more detail in Section 3.4.

3.3 Dropout in LSTM and GRU networksDespite simplicity and effectiveness of vanillaRNNs on a number of sequence modeling tasks,their training have been shown to pose seriousproblems, primarily caused by so called vanish-ing and exploding gradients (Bengio et al., 1994).While the problem of exploding gradients can beeffectively addressed by performing gradient clip-ping (Pascanu et al., 2013), the vanishing gradi-ents appear to be a more serious issue. There aretwo primary directions to address the latter issue:

(i) use more sophisticated Hessian-free optimiza-tion algorithms instead of Stochastic Gradient De-scend (SGD), e.g., Martens and Sutskever (2011),and (ii) make adjustments in the network architec-ture, e.g., by introducing gated inputs.Long Short-Term Memory net-works (Hochreiter and Schmidhuber, 1997)have introduced the concept of gated inputs inRNNs, which effectively allow the network topreserve its memory over a larger number oftime steps during both forward and backwardpasses, thus alleviating the problem of vanishinggradients. Formally, it is expressed with thefollowing equations:

⎛

⎜

⎜

⎝

itftotgt

⎞

⎟

⎟

⎠

=

⎛

⎜

⎜

⎝

σ(Wi

[

xt,ht−1

]

+ bi)σ(Wf

[

xt,ht−1

]

+ bf )σ(Wo

[

xt,ht−1

]

+ bo)f(Wg

[

xt,ht−1

]

+ bg)

⎞

⎟

⎟

⎠

(5)

ct = ft ∗ ct−1 + it ∗ gt (6)

ht = ot ∗ f(ct), (7)

where it, ft,ot are input, output and forget gates atstep t; gt is the vector of cell updates and ct is theupdated cell vector used to update the hidden stateht; σ is the sigmoid function and ∗ is the element-wise multiplication.Our approach is to apply dropout to the cell up-

date vector ct as follows:

ct = ft ∗ ct−1 + it ∗ d(gt) (8)

In contrast, Moon et al. (2015) propose to ap-ply dropout directly to the cell values and use per-sequence sampling:

ct = d(ft ∗ ct−1 + it ∗ gt) (9)

We will discuss the limitations of the approachof Moon et al. (2015) in Section 3.4 and supportour arguments with empirical evaluation in Sec-tions 4.1 and 4.2.Gated Recurrent Unit (GRU) networks area recently introduced variant of a recur-rent network with hidden state protected bygates (Cho et al., 2014). Different from LSTMs,GRU networks use only two gates rt and zt toupdate the cell’s hidden state ht:

(

ztrt

)

=

(

σ(Wz

[

xt,ht−1

]

+ bz)σ(Wr

[

xt,ht−1

]

+ br)

)

(10)

h(a) RNN

cg i o

f

(b) LSTM

hg z

r

(c) GRU

Figure 1: Illustration of the three types of recurrent networks used in the paper. Arrows, squares andcircles represent connections, hidden states and gates respectively. Dashed squares illustrate places wherewe apply dropout.

gt = f(Wg

!

xt, rt ∗ ht−1

"

+ bg) (11)

ht = (1− zt) ∗ ht−1 + zt ∗ gt (12)

Similarly to the LSTMs, we propoose to applydropout to the hidden state updates vector gt:

ht = (1− zt) ∗ ht−1 + zt ∗ d(gt) (13)

To the best of our knowledge, this work is the firstto study the effect of recurrent dropout in GRUnetworks.

3.4 Dropout and memoryBefore going further with the explanation of ourapproach we need to clarify one notational differ-ence between LSTM and GRU networks. Tradi-tionally, LSTM networks use “cells” to store his-tory and “hidden states” to output the informationto the rest of the network. GRU networks do nothave this distinction. In this subsection we will re-fer to LSTM cells and GRU hidden states when wesay network hidden states.Firstly, we found that an intuitive idea to drop

previous hidden states directly, as proposed inMoon et al. (2015), produces mixed results. Wehave observed that it helps the network to gen-eralize better when not coupled with the forwarddropout, but is no longer beneficial when usedtogether with regular forward dropout. Whilein some cases applying both forward and hid-den state dropout yields an improvement in thenetwork performance, we found this outcome tobe rare and hard to achieve. Below we discusstwo primary problems with the dropout method ofMoon et al. (2015).The first problem arises from the fact that gated

networks do not overwrite their hidden state onevery processing step. Instead, they graduallychange it by using the element-wise multiplica-tion with gate values and summation with a newhidden state candidate. Therefore, after a neu-ron has stored a value it will remain in neuron’s

memory until the forget gate is activated. Drop-ping a neuron in LSTM or GRU networks is ef-fectively equivalent to activating the forget gatebecause when it is zeroed out by the dropout itloses all of its history. As a consequence, trainingwith per-step and per-sequence mask sampling re-sults in very different dynamics in a network dur-ing training. In case of a per-step mask sampling,network’s ability to learn long-term relationshipsis reduced because every unit’s history gets deletedvery often. In general a network can still be able toovercome this by constantly refreshing its hiddenstate, but this is not trivial to learn and forces a net-work to behave similarly to vanilla RNNs. As anexample, when training a network with 0.2 recur-rent dropout the probability that a neuron’s statewill be zeroed in 20 steps is almost 0.99. This canbe addressed by sampling a dropout mask once persequence. Moon et al. (2015) make the same ob-servation and propose to use per-sequence masksampling as well. However, in this case some neu-rons behave as if dropout was not used at all, whileother neurons remain idle when processing a giveninput sequence.The second problem is caused by the scaling of

neuron activations. Consider the hidden state up-date rule in the test phase of an extremely simpli-fied version of a gated network with all the gatesalways equal to 1:

ht = (ht−1 + gt)p, (14)

where gt are update vectors, computed by Eq. 5for LSTM and Eq. 11 for GRU networks and pis the probability to not drop a neuron. As ht−1

was, in turn, computed using the same rule, wecan rewrite this equation as:

ht = ((ht−2 + gt−1)p+ gt)p (15)

Recursively expanding h for every timestep re-sults in the following equation:

ht = ((((h0 + g0)p+ g1)p+ ...)p + gt)p (16)

LSTM

GRU


• Batch Normalization [Ioffe+15] Normalization, scale and shift function at intermediate layer. Popular for CNN (DNN).

• RNN needs tricks [Cooijmans+16, Laurent+15]• Mean and Variance value for normalization are

prepared at each time step• Don’t insert at cell’s recurrence• Initialize scale value lower (0.1).

Prevent sigm/tanh saturation initially.

Batch Normalization

the hidden-to-hidden transformations. We introduce the batch-normalizing transform BN( · ; �,�)into the LSTM as follows:

0

BB@

f

t

i

t

o

t

g

t

1

CCA = BN(Wh

h

t�1; �h,�h

) + BN(Wx

x

t

; �x

,�

x

) + b (6)

c

t

= �(ft

)� c

t�1 + �(it

)� tanh(gt

) (7)h

t

= �(ot

)� tanh(BN(ct

; �c

,�

c

)) (8)

In our formulation, we normalize the recurrent term W

h

h

t�1 and the input term W

x

x

t

separately.Normalizing these terms individually gives the model better control over the relative contributionof the terms using the �

h

and �

x

parameters. We set �h

= �

x

= 0 to avoid unnecessary redun-dancy, instead relying on the pre-existing parameter vector b to account for both biases. In order toleave the LSTM dynamics intact and preserve the gradient flow through c

t

, we do not apply batchnormalization in the cell update.

The batch normalization transform relies on batch statistics to standardize the LSTM activations. Itwould seem natural to share the statistics that are used for normalization across time, just as recurrentneural networks share their parameters over time. However, we have found that simply averagingstatistics over time severely degrades performance. Although LSTM activations do converge to astationary distribution, we have empirically observed that their statistics during the initial transientdiffer significantly as figure 1 shows. Consequently, we recommend using separate statistics foreach timestep to preserve information of the initial transient phase in the activations.1

Generalizing the model to sequences longer than those seen during training is straightforward thanksto the rapid convergence of the activations to their steady-state distributions (cf. Figure 1). For ourexperiments we estimate the population statistics separately for each timestep 1, . . . , T

max

whereT

max

is the length of the longest training sequence. When at test time we need to generalize beyondT

max

, we use the population statistic of time T

max

for all time steps beyond it.

During training we estimate the statistics across the minibatch, independently for each timestep. Attest time we use estimates obtained by averaging the minibatch estimates over the training set.

4 Initializing � for Gradient Flow

Although batch normalization allows for easy control of the pre-activation variance through the �

parameters, common practice is to normalize to unit variance. We suspect that the previous difficul-ties with recurrent batch normalization reported in the literature [14, 1] are largely due to improperinitialization of the batch normalization parameters, and � in particular.

In this section we demonstrate the impact of � on gradient flow.

In Figure 2(a) we show how the pre-activation variance impacts gradient propagation in a simpleRNN on the sequential MNIST task described in Section 5.1. Since backpropagation operates inreverse, the plot is best read from right to left. The quantity plotted is the norm of the gradientof the loss with respect to the hidden state at different time steps. For large values of �, the normquickly goes to zero as gradient is propagated back in time. For small values of � the norm is nearlyconstant.

Figure 2(b) empirically how the expected derivative of the tanh nonlinearity changes with the vari-ance of the argument. We drew samples x from a set of centered Gaussian distributions with standarddeviation ranging from 0 to 1, and computed the derivative tanh0(x) = 1� tanh2(x) 2 [0, 1]. Thedistribution of this value (mean and interquartile range) as a function of the standard deviation of xis shown in Figure 2(b). When the input standard deviation is low, the input tends to be close to theorigin where the derivative is close to 1. As the standard deviation increases, the expected derivative

1 Note that we separate only the statistics over time and not the � and � parameters.

4

vector, and X be the set of these inputs over the trainingdata set. The normalization can then be written as a trans-formation

!x = Norm(x,X )

which depends not only on the given training example xbut on all examples X – each of which depends on Θ ifx is generated by another layer. For backpropagation, wewould need to compute the Jacobians

∂Norm(x,X )

∂xand

∂Norm(x,X )

∂X;

ignoring the latter term would lead to the explosion de-scribed above. Within this framework, whitening the layerinputs is expensive, as it requires computing the covari-ance matrix Cov[x] = Ex∈X [xxT ] − E[x]E[x]T and itsinverse square root, to produce the whitened activationsCov[x]−1/2(x − E[x]), as well as the derivatives of thesetransforms for backpropagation. This motivates us to seekan alternative that performs input normalization in a waythat is differentiable and does not require the analysis ofthe entire training set after every parameter update.Some of the previous approaches (e.g.

(Lyu & Simoncelli, 2008)) use statistics computedover a single training example, or, in the case of imagenetworks, over different feature maps at a given location.However, this changes the representation ability of anetwork by discarding the absolute scale of activations.We want to a preserve the information in the network, bynormalizing the activations in a training example relativeto the statistics of the entire training data.

3 Normalization via Mini-BatchStatistics

Since the full whitening of each layer’s inputs is costlyand not everywhere differentiable, we make two neces-sary simplifications. The first is that instead of whiteningthe features in layer inputs and outputs jointly, we willnormalize each scalar feature independently, by making ithave the mean of zero and the variance of 1. For a layerwith d-dimensional input x = (x(1) . . . x(d)), we will nor-malize each dimension

!x(k) =x(k) − E[x(k)]"

Var[x(k)]

where the expectation and variance are computed over thetraining data set. As shown in (LeCun et al., 1998b), suchnormalization speeds up convergence, even when the fea-tures are not decorrelated.Note that simply normalizing each input of a layer may

change what the layer can represent. For instance, nor-malizing the inputs of a sigmoid would constrain them tothe linear regime of the nonlinearity. To address this, wemake sure that the transformation inserted in the networkcan represent the identity transform. To accomplish this,

we introduce, for each activation x(k), a pair of parametersγ(k),β(k), which scale and shift the normalized value:

y(k) = γ(k)!x(k) + β(k).

These parameters are learned along with the originalmodel parameters, and restore the representation powerof the network. Indeed, by setting γ(k) =

"Var[x(k)] and

β(k) = E[x(k)], we could recover the original activations,if that were the optimal thing to do.In the batch setting where each training step is based on

the entire training set, we would use the whole set to nor-malize activations. However, this is impractical when us-ing stochastic optimization. Therefore, we make the sec-ond simplification: since we use mini-batches in stochas-tic gradient training, each mini-batch produces estimatesof the mean and variance of each activation. This way, thestatistics used for normalization can fully participate inthe gradient backpropagation. Note that the use of mini-batches is enabled by computation of per-dimension vari-ances rather than joint covariances; in the joint case, reg-ularization would be required since the mini-batch size islikely to be smaller than the number of activations beingwhitened, resulting in singular covariance matrices.Consider a mini-batch B of size m. Since the normal-

ization is applied to each activation independently, let usfocus on a particular activation x(k) and omit k for clarity.We havem values of this activation in the mini-batch,

B = {x1...m}.

Let the normalized values be !x1...m, and their linear trans-formations be y1...m. We refer to the transform

BNγ,β : x1...m → y1...m

as the Batch Normalizing Transform. We present the BNTransform in Algorithm 1. In the algorithm, ϵ is a constantadded to the mini-batch variance for numerical stability.

Input: Values of x over a mini-batch: B = {x1...m};Parameters to be learned: γ, β

Output: {yi = BNγ,β(xi)}

µB ←1

m

m#

i=1

xi // mini-batch mean

σ2B ←

1

m

m#

i=1

(xi − µB)2 // mini-batch variance

!xi ←xi − µB"σ2B+ ϵ

// normalize

yi ← γ!xi + β ≡ BNγ,β(xi) // scale and shift

Algorithm 1: Batch Normalizing Transform, applied toactivation x over a mini-batch.

The BN transform can be added to a network to manip-ulate any activation. In the notation y = BNγ,β(x), we

3


• Norm-stabilizer [Krueger+15]• Penalize the difference between the norms of hidden

vectors at successive time steps

• Temporal Coherence Loss [Jonschkowski&Brock15]• Penalize the difference between the hidden vectors

at successive time steps.

Regularization for Successiveness


REGULARIZING RNNS BY STABILIZING ACTIVATIONS

David Krueger & Roland Memisevic

Department of Computer Science and Operations ResearchUniversity of MontrealMontreal, QC H3T 1J4, Canada{[email protected], [email protected]}

ABSTRACT

We stabilize the activations of Recurrent Neural Networks (RNNs) by penalizingthe squared distance between successive hidden states’ norms. This penalty termis an effective regularizer for RNNs including LSTMs and IRNNs, improvingperformance on character-level language modeling and phoneme recognition, andoutperforming weight noise and dropout. We achieve competitive performance(18.6% PER) on the TIMIT phoneme recognition task for RNNs evaluated withoutbeam search or an RNN transducer. With this penalty term, IRNN can achievesimilar performance to LSTM on language modeling, although adding the penaltyterm to the LSTM results in superior performance. Our penalty term also preventsthe exponential growth of IRNN’s activations outside of their training horizon,allowing them to generalize to much longer sequences.

1 INTRODUCTION

Overfitting in machine learning is addressed by restricting the space of hypotheses ( i.e. functions)considered. This can be accomplished by reducing the number of parameters or using a regularizerwith an inductive bias for simpler models, such as early stopping. More effective regularizationcan be achieved by incorporating more sophisticated prior knowledge. Keeping an RNN’s hiddenactivations on a reasonable path can be difficult, especially across long time-sequences. With thisin mind, we devise a regularizer for the state representation learned by temporal models, such asRNNs, that aims to encourage stability of the path taken through representation space. Specifically,we propose the following additional cost term for Recurrent Neural Networks (RNNs):

�1

T

TX

t=1

(khtk2 � kht�1k2)2

Where ht is the vector of hidden activations at time-step t, and � is a hyperparameter controlling theamounts of regularization. We call this penalty the norm-stabilizer, as it successfully encourages thenorms of the hiddens to be stable (i.e. approximately constant across time). Unlike the “temporalcoherence” penalty of Jonschkowski & Brock (2015), our penalty does not encourage the staterepresentation to remain constant, only its norm.

In the absence of inputs and nonlinearities, a constant norm would imply orthogonality of the hidden-to-hidden transition matrix for simple RNNs (SRNNs). However, in the case of an orthogonal tran-sition matrix, inputs and nonlinearities can still change the norm of the hidden state, resulting ininstability. This makes targeting the hidden activations directly a more attractive option for achiev-ing norm stability. Stability becomes especially important when we seek to generalize to longersequences at test time than those seen during training (the “training horizon”).

The hidden state in LSTM (Hochreiter & Schmidhuber, 1997) is usually the product of two squash-ing nonlinearities, and hence bounded. The norm of the memory cell, however, can grow linearlywhen the input, input modulation, and forget gates are all saturated at 1. Nonetheless, we find thatthe memory cells exhibit norm stability far past the training horizon, and suggest that this may bepart of what makes LSTM so successful.

1

arX

iv:1

511.

0840

0v7

[cs.N

E] 2

6 A

pr 2

016


• Auto-(sentence-)encoder and language modeling as pre-training for sentence classifications. (But, joint learning is not good.) [Dai&Le15]

• Multi-task learning for encoder-decoder.The coefficients of tasks’ losses are so important.(Multi-language translation, parsing, image captioning, auto-encoder, skip-thought vectors) [Luong+15]

Multi-task Learning


English (unsupervised)

German (translation)

Tags (parsing)English

Figure 2: One-to-many Setting – one encoder, multiple decoders. This scheme is useful for eithermulti-target translation as in Dong et al. (2015) or between different tasks. Here, English and Ger-man imply sequences of words in the respective languages. The α values give the proportions ofparameter updates that are allocated for the different tasks.

for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma-chine translation (Luong et al., 2015a), and (c) the same sequence of English words for autoencodersor a related sequence of English words for the skip-thought objective (Kiros et al., 2015).

3.2 MANY-TO-ONE SETTING

This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul-tiple encoders and one decoder. This is useful for tasks in which only the decoder can be shared, forexample, when our tasks include machine translation and image caption generation (Vinyals et al.,2015b). In addition, from a machine translation perspective, this setting can benefit from a largeamount of monolingual data on the target side, which is a standard practice in machine translationsystem and has also been explored for neural MT by Gulcehre et al. (2015).


Image (captioning) English


Figure 3: Many-to-one setting – multiple encoders, one decoder. This scheme is handy for tasks inwhich only the decoders can be shared.

3.3 MANY-TO-MANY SETTING

Lastly, as the name describes, this category is the most general one, consisting of multiple encodersand multiple decoders. We will explore this scheme in a translation setting that involves sharingmultiple encoders and multiple decoders. In addition to the machine translation task, we will includetwo unsupervised objectives over the source and target languages as illustrated in Figure 4.

3.4 UNSUPERVISED LEARNING TASKS

Our very first unsupervised learning task involves learning autoencoders from monolingual corpora,which has recently been applied to sequence to sequence learning (Dai & Le, 2015). However, inDai & Le (2015)’s work, the authors only experiment with pretraining and then finetuning, but notjoint training which can be viewed as a form of multi-task learning (MTL). As such, we are veryinterested in knowing whether the same trend extends to our MTL settings.

Additionally, we investigate the use of the skip-thought vectors (Kiros et al., 2015) in the context ofour MTL framework. Skip-thought vectors are trained by training sequence to sequence models onpairs of consecutive sentences, which makes the skip-thought objective a natural seq2seq learningcandidate. A minor technical difficulty with skip-thought objective is that the training data must

3




Tags (parsing)English

Figure 2: One-to-many Setting – one encoder, multiple decoders. This scheme is useful for eithermulti-target translation as in Dong et al. (2015) or between different tasks. Here, English and Ger-man imply sequences of words in the respective languages. The α values give the proportions ofparameter updates that are allocated for the different tasks.

for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma-chine translation (Luong et al., 2015a), and (c) the same sequence of English words for autoencodersor a related sequence of English words for the skip-thought objective (Kiros et al., 2015).

3.2 MANY-TO-ONE SETTING

This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul-tiple encoders and one decoder. This is useful for tasks in which only the decoder can be shared, forexample, when our tasks include machine translation and image caption generation (Vinyals et al.,2015b). In addition, from a machine translation perspective, this setting can benefit from a largeamount of monolingual data on the target side, which is a standard practice in machine translationsystem and has also been explored for neural MT by Gulcehre et al. (2015).


Image (captioning) English


Figure 3: Many-to-one setting – multiple encoders, one decoder. This scheme is handy for tasks inwhich only the decoders can be shared.

3.3 MANY-TO-MANY SETTING

Lastly, as the name describes, this category is the most general one, consisting of multiple encodersand multiple decoders. We will explore this scheme in a translation setting that involves sharingmultiple encoders and multiple decoders. In addition to the machine translation task, we will includetwo unsupervised objectives over the source and target languages as illustrated in Figure 4.

3.4 UNSUPERVISED LEARNING TASKS

Our very first unsupervised learning task involves learning autoencoders from monolingual corpora,which has recently been applied to sequence to sequence learning (Dai & Le, 2015). However, inDai & Le (2015)’s work, the authors only experiment with pretraining and then finetuning, but notjoint training which can be viewed as a form of multi-task learning (MTL). As such, we are veryinterested in knowing whether the same trend extends to our MTL settings.

Additionally, we investigate the use of the skip-thought vectors (Kiros et al., 2015) in the context ofour MTL framework. Skip-thought vectors are trained by training sequence to sequence models onpairs of consecutive sentences, which makes the skip-thought objective a natural seq2seq learningcandidate. A minor technical difficulty with skip-thought objective is that the training data must

3



English (unsupervised) German (unsupervised)

English

Figure 4: Many-to-many setting – multiple encoders, multiple decoders. We consider this schemein a limited context of machine translation to utilize the large monolingual corpora in both thesource and the target languages. Here, we consider a single translation task and two unsupervisedautoencoder tasks.

consist of ordered sentences, e.g., paragraphs. Unfortunately, in many applications that includemachine translation, we only have sentence-level data where the sentences are unordered. To addressthat, we split each sentence into two halves; we then use one half to predict the other half.

3.5 LEARNING

Dong et al. (2015) adopted an alternating training approach, where they optimize each task for afixed number of parameter updates (or mini-batches) before switching to the next task (which is adifferent language pair). In our setting, our tasks are more diverse and contain different amounts oftraining data. As a result, we allocate different numbers of parameter updates for each task, whichare expressed with the mixing ratio values αi (for each task i). Each parameter update consists oftraining data from one task only. When switching between tasks, we select randomly a new task iwith probability αi!

j αj.

Our convention is that the first task is the reference task with α1 = 1.0 and the number of trainingparameter updates for that task is prespecified to beN . A typical task iwill then be trained for αi

α1·N

parameter updates. Such convention makes it easier for us to fairly compare the same reference taskin a single-task setting which has also been trained for exactlyN parameter updates.

When sharing an encoder or a decoder, we share both the recurrent connections and the correspond-ing embeddings.

4 EXPERIMENTS

We evaluate the multi-task learning setup on a wide variety of sequence-to-sequence tasks: con-stituency parsing, image caption generation, machine translation, and a number of unsupervisedlearning as summarized in Table 1.

4.1 DATA

Our experiments are centered around the translation task, where we aim to determine whether othertasks can improve translation and vice versa. We use the WMT’15 data (Bojar et al., 2015) forthe English!German translation problem. Following Luong et al. (2015a), we use the 50K mostfrequent words for each language from the training corpus.1 These vocabularies are then sharedwith other tasks, except for parsing in which the target “language” has a vocabulary of 104 tags. Weuse newstest2013 (3000 sentences) as a validation set to select our hyperparameters, e.g., mixingcoefficients. For testing, to be comparable with existing results in (Luong et al., 2015a), we use thefiltered newstest2014 (2737 sentences)2 for the English→German translation task and newstest2015(2169 sentences)3 for the German→English task. See the summary in Table 1.

1The corpus has already been tokenized using the default tokenizer from Moses. Words not in these vocab-ularies are represented by the token <unk>.

2http://statmt.org/wmt14/test-filtered.tgz3http://statmt.org/wmt15/test.tgz

4


• Lighten Calculation of Softmax output• Copy mechanism• Character-based• Global Optimization of Decoding

4. Decoding


• Softmax over large vocabulary (class) has large time and space computational complexity.Lighten (replace) it by• Sampled Softmax• Class-factored Softmax• Hierarchical Softmax• BlackOut• Noise Constrastive Estimation; NCE• Self-normalization• Negative Sampling

Lighten Softmax


• Copy function from source sentence• [Gulcehre+16] calculates attention distribution over source

sentence(‘s LSTM outputs).(Pointer Networks [Vinyals+15])Sigmoid gate-based weighted sum ofcommon vocabulary output probability andcopy vocabulary probability distribution.

• [Gu+16]Similar, but morecomplicatedstructure

Copy Mechanism

hello , my name is Tony Jebara .

AttentiveRead

hi , Tony Jebara

<eos> hi , Tony

h1 h2 h3 h4 h5

s1 s2 s3 s4

h6 h7 h8“Tony”

DNN

Embedding for “Tony”Selective Read for “Tony”

(a) Attention-based Encoder-Decoder (RNNSearch)(c) State Update

s4

SourceVocabulary

SoftmaxProb(“Jebara”) = Prob(“Jebara”, g) + Prob(“Jebara”, c)

… ...

(b) Generate-Mode & Copy-Mode

!

M

M

Figure 1: The overall diagram of COPYNET. For simplicity, we omit some links for prediction (seeSections 3.2 for more details).

Decoder: An RNN that reads M and predictsthe target sequence. It is similar with the canoni-cal RNN-decoder in (Bahdanau et al., 2014), withhowever the following important differences

• Prediction: COPYNET predicts words basedon a mixed probabilistic model of two modes,namely the generate-mode and the copy-mode, where the latter picks words from thesource sequence (see Section 3.2);

• State Update: the predicted word at time t�1is used in updating the state at t, but COPY-NET uses not only its word-embedding butalso its corresponding location-specific hid-den state in M (if any) (see Section 3.3 formore details);

• Reading M: in addition to the attentive readto M, COPYNET also has“selective read”to M, which leads to a powerful hybrid ofcontent-based addressing and location-basedaddressing (see both Sections 3.3 and 3.4 formore discussion).

3.2 Prediction with Copying and GenerationWe assume a vocabulary V = {v1, ..., vN}, anduse UNK for any out-of-vocabulary (OOV) word.In addition, we have another set of words X , forall the unique words in source sequence X =

{x1, ..., xTS}. Since X may contain words notin V , copying sub-sequence in X enables COPY-NET to output some OOV words. In a nutshell,the instance-specific vocabulary for source X isV [ UNK [ X .

Given the decoder RNN state s

t

at time t to-gether with M, the probability of generating anytarget word y

t

, is given by the “mixture” of proba-bilities as follows

p(y

t

|st

, y

t�1, ct,M) = p(y

t

, g|st

, y

t�1, ct,M)

+ p(y

t

, c|st

, y

t�1, ct,M) (4)

where g stands for the generate-mode, and c thecopy mode. The probability of the two modes aregiven respectively by

p(y

t

, g|·)=

8>><

>>:

1

Z

e

g(yt), y

t

2 V0, y

t

2 X \ ¯

V

1

Z

e

g(UNK)y

t

62 V [ X(5)

p(y

t

, c|·)=(

1

Z

Pj:xj=yt

e

c(xj), y

t

2 X0 otherwise

(6)

where

g

(·) and

c

(·) are score functions forgenerate-mode and copy-mode, respectively, andZ is the normalization term shared by the twomodes, Z =

Pv2V[{UNK} e

g(v)+

Px2X e

c(x).

Due to the shared normalization term, the twomodes are basically competing through a softmaxfunction (see Figure 1 for an illustration with ex-ample), rendering Eq.(4) different from the canon-ical definition of the mixture model (McLachlanand Basford, 1988). This is also pictorially illus-trated in Figure 2. The score of each mode is cal-culated:


forms and their meanings is non-trivial (de Saus-sure, 1916). While some compositional relation-ships exist, e.g., morphological processes such asadding -ing or -ly to a stem have relatively reg-ular effects, many words with lexical similaritiesconvey different meanings, such as, the word pairslesson () lessen and coarse () course.

3 C2W Model

Our compositional character to word (C2W)model is based on bidirectional LSTMs (Gravesand Schmidhuber, 2005), which are able tolearn complex non-local dependencies in sequencemodels. An illustration is shown in Figure 1. Theinput of the C2W model (illustrated on bottom) isa single word type w, and we wish to obtain isa d-dimensional vector used to represent w. Thismodel shares the same input and output of a wordlookup table (illustrated on top), allowing it to eas-ily replace then in any network.

As input, we define an alphabet of charactersC. For English, this vocabulary would contain anentry for each uppercase and lowercase letter aswell as numbers and punctuation. The input wordw is decomposed into a sequence of charactersc1, . . . , cm, where m is the length of w. Each c

i

is defined as a one hot vector 1ci , with one on the

index of ci

in vocabulary M . We define a projec-tion layer P

C

2 RdC⇥|C|, where dC

is the numberof parameters for each character in the characterset C. This of course just a character lookup table,and is used to capture similarities between charac-ters in a language (e.g., vowels vs. consonants).Thus, we write the projection of each input char-acter c

i

as eci = P

C

· 1ci .

Given the input vectors x1, . . . ,xm

, a LSTMcomputes the state sequence h1, . . . ,hm+1 by it-eratively applying the following updates:

i

t

= �(Wix

x

t

+W

ih

h

t�1 +W

ic

c

t�1 + b

i

)

f

t

= �(Wfx

x

t

+W

fh

h

t�1 +W

fc

c

t�1 + b

f

)

c

t

= f

t

� c

t�1+

i

t

� tanh(W

cx

x

t

+W

ch

h

t�1 + b

c

)

o

t

= �(Wox

x

t

+W

oh

h

t�1 +W

oc

c

t

+ b

o

)

h

t

= o

t

� tanh(c

t

),

where � is the component-wise logistic sig-moid function, and � is the component-wise(Hadamard) product. LSTMs define an extra cellmemory c

t

, which is combined linearly at each

cats

cat

cats

job

....

....

........

cats

c a t s

a

c

t

....

....

s

CharacterLookupTable

........

Word Lookup Table

Bi-LSTM

embeddings for word "cats"

embeddings for word "cats"

Figure 1: Illustration of the word lookup tables(top) and the lexical Composition Model (bottom).Square boxes represent vectors of neuron activa-tions. Shaded boxes indicate that a non-linearity.

timestamp t. The information that is propagatedfrom c

t�1 to c

t

is controlled by the three gates it

,f

t

, and o

t

, which determine the what to includefrom the input x

t

, the what to forget from c

t�1 andwhat is relevant to the current state h

t

. We writeW to refer to all parameters the LSTM (W

ix

,W

fx

, bf

, . . . ). Thus, given a sequence of charac-ter representations e

C

c1, . . . , eC

cmas input, the for-

ward LSTM, yields the state sequence sf0 , . . . , sf

m

,while the backward LSTM receives as input the re-verse sequence, and yields states sb

m

, . . . , sb0. BothLSTMs use a different set of parameters Wf andWb. The representation of the word w is obtainedby combining the forward and backward states:

e

C

w

= D

f

s

f

m

+D

b

s

b

0 + b

d

,

where D

f , Db and b

d

are parameters that deter-

• In/output chunk is a character, not a defined word• Language model, various task’s input features,

machine translation’s decoding LM: [Sutskever+11, Graves13, Ling+15a, Kim+15], MT: [Chung+16, Ling+15b, Costa-jussa&Fonollosa16, Luong+16]

• Combination of words and characters[Kang+11, Józefowicz+16, Miyamoto&Cho16]

• (Not only RNN composition,but also CNN)

• Good in terms of morphologyand rare word problem

Character-based


Figure 1: Illustration of the Scheduled Sampling approach,where one flips a coin at every time step to decide to use thetrue previous token or one sampled from the model itself.

Figure 2: Examples of decayschedules.

We thus propose to use a schedule to decrease ✏i as a function of i itself, in a similar manner usedto decrease the learning rate in most modern stochastic gradient descent approaches. Examples ofsuch schedules can be seen in Figure 2 as follows:

• Linear decay: ✏i = max(✏, k � ci) where 0 ✏ < 1 is the minimum amount of truth to begiven to the model and k and c provide the offset and slope of the decay, which depend onthe expected speed of convergence.

• Exponential decay: ✏i = k

i where k < 1 is a constant that depends on the expected speedof convergence.

• Inverse sigmoid decay: ✏i = k/(k+exp(i/k)) where k � 1 depends on the expected speedof convergence.

We call our approach Scheduled Sampling. Note that when we sample the previous token yt�1 fromthe model itself while training, we could back-propagate the gradient of the losses at times t ! T

through that decision. This was not done in the experiments described in this paper and is left forfuture work.

3 Related Work

The discrepancy between the training and inference distributions has already been noticed in theliterature, in particular for control and reinforcement learning tasks.

SEARN [9] was proposed to tackle problems where supervised training examples might be differentfrom actual test examples when each example is made of a sequence of decisions, like acting in acomplex environment where a few mistakes of the model early in the sequential decision processmight compound and yield a very poor global performance. Their proposed approach involves ameta-algorithm where at each meta-iteration one trains a new model according to the current policy(essentially the expected decisions for each situation), applies it on a test set and modifies the nextiteration policy in order to account for the previous decisions and errors. The new policy is thus acombination of the previous one and the actual behavior of the model.

In comparison to SEARN and related ideas [10, 11], our proposed approach is completely online: asingle model is trained and the policy slowly evolves during training, instead of a batch approach,which makes it much faster to train3 Furthermore, SEARN has been proposed in the context ofreinforcement learning, while we consider the supervised learning setting trained using stochasticgradient descent on the overall objective.

Other approaches have considered the problem from a ranking perspective, in particular for parsingtasks [12] where the target output is a tree. In this case, the authors proposed to use a beam searchboth during training and inference, so that both phases are aligned. The training beam is used to find

3In fact, in the experiments we report in this paper, our proposed approach was not meaningfully slower(nor faster) to train than the baseline.

4

• Reinventing the wheel (of non-NN research)?(Even so, these are useful and good next steps.)

• Use model’s prediction as next input in training (while usually only true input is used) [Bengio+15]• Similar to DAgger [Daumé III16; Blog]• Use dynamic oracle [Daumé III16; Blog]

[Ballesteros+16,Goldberg&Nivre13]

Global Decoding


• REINFORCE to optimize BLEU/ROUGE [Ranzato+15]

• Minimum Risk Training [Shen+15, Ayana+16]• Optimization for beam search [Wiseman&Rush16]

Global Decoding


h2 = �✓

(�, h1)

p✓

(w|�, h1)

XENT

h1

�

w2 w3XENT

top-kw0

1,...,k

p✓

(w|w01,...,k

, h2) w001,...,k

h3 = �✓

(w01,...,k

, h2)

top-k

Figure 3: Illustration of the End-to-End BackProp method. The first steps of the unrolled sequence(here just the first step) are exactly the same as in a regular RNN trained with cross-entropy. How-ever, in the remaining steps the input to each module is a sparse vector whose non-zero entries arethe k largest probabilities of the distribution predicted at the previous time step. Errors are back-propagated through these inputs as well.

While this algorithm is a simple way to expose the model to its own predictions, the loss functionoptimized is still XENT at each time step. There is no explicit supervision at the sequence levelwhile training the model.

3.2 SEQUENCE LEVEL TRAINING

We now introduce a novel algorithm for sequence level training, which we call Mixed IncrementalCross-Entropy Reinforce (MIXER). The proposed method avoids the exposure bias problem, andalso directly optimizes for the final evaluation metric. Since MIXER is an extension of the REIN-FORCE algorithm, we first describe REINFORCE from the perspective of sequence generation.

3.2.1 REINFORCE

In order to apply the REINFORCE algorithm (Williams, 1992; Zaremba & Sutskever, 2015) to theproblem of sequence generation we cast our problem in the reinforcement learning (RL) frame-work (Sutton & Barto, 1988). Our generative model (the RNN) can be viewed as an agent, whichinteracts with the external environment (the words and the context vector it sees as input at everytime step). The parameters of this agent defines a policy, whose execution results in the agent pick-ing an action. In the sequence generation setting, an action refers to predicting the next word inthe sequence at each time step. After taking an action the agent updates its internal state (the hid-den units of RNN). Once the agent has reached the end of a sequence, it observes a reward. Wecan choose any reward function. Here, we use BLEU (Papineni et al., 2002) and ROUGE-2 (Lin& Hovy, 2003) since these are the metrics we use at test time. BLEU is essentially a geometricmean over n-gram precision scores as well as a brevity penalty (Liang et al., 2006); in this work, weconsider up to 4-grams. ROUGE-2 is instead recall over bi-grams. Like in imitation learning, wehave a training set of optimal sequences of actions. During training we choose actions according tothe current policy and only observe a reward at the end of the sequence (or after maximum sequencelength), by comparing the sequence of actions from the current policy against the optimal actionsequence. The goal of training is to find the parameters of the agent that maximize the expectedreward. We define our loss as the negative expected reward:

L✓

= �X

w

g1 ,...,w

gT

p✓

(wg

1 , . . . , wg

T

)r(wg

1 , . . . , wg

T

) = �E[wg1 ,...w

gT ]⇠p✓

r(wg

1 , . . . , wg

T

), (9)

where wg

n

is the word chosen by our model at the n-th time step, and r is the reward associatedwith the generated sequence. In practice, we approximate this expectation with a single samplefrom the distribution of actions implemented by the RNN (right hand side of the equation aboveand Figure 9 of Supplementary Material). We refer the reader to prior work (Zaremba & Sutskever,2015; Williams, 1992) for the full derivation of the gradients. Here, we directly report the partialderivatives and their interpretation. The derivatives w.r.t. parameters are:

@L✓

@✓=

X

t

@L✓

@ot

@ot

@✓(10)

6


h2 = �✓

(�, h1)

p✓

(w|�, h1)

XENT

h1

�

w2 w3XENT

top-kw0

1,...,k

p✓

(w|w01,...,k

, h2) w001,...,k

h3 = �✓

(w01,...,k

, h2)

top-k

Figure 3: Illustration of the End-to-End BackProp method. The first steps of the unrolled sequence(here just the first step) are exactly the same as in a regular RNN trained with cross-entropy. How-ever, in the remaining steps the input to each module is a sparse vector whose non-zero entries arethe k largest probabilities of the distribution predicted at the previous time step. Errors are back-propagated through these inputs as well.

While this algorithm is a simple way to expose the model to its own predictions, the loss functionoptimized is still XENT at each time step. There is no explicit supervision at the sequence levelwhile training the model.

3.2 SEQUENCE LEVEL TRAINING

We now introduce a novel algorithm for sequence level training, which we call Mixed IncrementalCross-Entropy Reinforce (MIXER). The proposed method avoids the exposure bias problem, andalso directly optimizes for the final evaluation metric. Since MIXER is an extension of the REIN-FORCE algorithm, we first describe REINFORCE from the perspective of sequence generation.

3.2.1 REINFORCE

In order to apply the REINFORCE algorithm (Williams, 1992; Zaremba & Sutskever, 2015) to theproblem of sequence generation we cast our problem in the reinforcement learning (RL) frame-work (Sutton & Barto, 1988). Our generative model (the RNN) can be viewed as an agent, whichinteracts with the external environment (the words and the context vector it sees as input at everytime step). The parameters of this agent defines a policy, whose execution results in the agent pick-ing an action. In the sequence generation setting, an action refers to predicting the next word inthe sequence at each time step. After taking an action the agent updates its internal state (the hid-den units of RNN). Once the agent has reached the end of a sequence, it observes a reward. Wecan choose any reward function. Here, we use BLEU (Papineni et al., 2002) and ROUGE-2 (Lin& Hovy, 2003) since these are the metrics we use at test time. BLEU is essentially a geometricmean over n-gram precision scores as well as a brevity penalty (Liang et al., 2006); in this work, weconsider up to 4-grams. ROUGE-2 is instead recall over bi-grams. Like in imitation learning, wehave a training set of optimal sequences of actions. During training we choose actions according tothe current policy and only observe a reward at the end of the sequence (or after maximum sequencelength), by comparing the sequence of actions from the current policy against the optimal actionsequence. The goal of training is to find the parameters of the agent that maximize the expectedreward. We define our loss as the negative expected reward:

L✓

= �X

w

g1 ,...,w

gT

p✓

(wg

1 , . . . , wg

T

)r(wg

1 , . . . , wg

T

) = �E[wg1 ,...w

gT ]⇠p✓

r(wg

1 , . . . , wg

T

), (9)

where wg

n

is the word chosen by our model at the n-th time step, and r is the reward associatedwith the generated sequence. In practice, we approximate this expectation with a single samplefrom the distribution of actions implemented by the RNN (right hand side of the equation aboveand Figure 9 of Supplementary Material). We refer the reader to prior work (Zaremba & Sutskever,2015; Williams, 1992) for the full derivation of the gradients. Here, we directly report the partialderivatives and their interpretation. The derivatives w.r.t. parameters are:

@L✓

@✓=

X

t

@L✓

@ot

@ot

@✓(10)

6

We can optimize the loss L using a two-step pro-cess: (1) in a forward pass, we compute candidatesets St and record margin violations (sequences withnon-zero loss); (2) in a backward pass, we back-propagate the errors through the seq2seq RNNs. Un-like standard seq2seq training, the first-step requiresrunning search (in our case beam search) to findmargin violations. The second step can be doneby adapting back-propagation through time (BPTT).We next discuss the details of this process.

4.2 Forward: Find Violations

In order to minimize this loss, we need to spec-ify a procedure for constructing candidate sequencesy(K)1:t at each time step t so that we find margin vi-

olations. We follow LaSO (rather than early-update2; see Section 2) and build candidates in a recursivemanner. If there was no margin violation at t�1,then St is constructed using a standard beam searchupdate. If there was a margin violation, St is con-structed as the K best sequences assuming the goldhistory y1:t�1 through time-step t�1.

Formally, let the function succ map a sequencew1:t�1 2Vt�1 to the set of all valid sequences oflength t that can be formed by appending a validword w2V onto the end of w1:t�1. In the simplest,unconstrained case, we will have

succ(w1:t�1) = {w1:t�1, w | w 2 V}.

Note, however, that for some problems it may bepreferable to define a succ function which imposeshard constraints on successor sequences. For in-stance, if we would like to use seq2seq models forparsing (by emitting a constituency or dependencystructure encoded into a sequence in some way),we will have hard constraints on the sequences themodel can output, namely, that they represent validparses. While hard constraints such as these wouldbe difficult to add to standard seq2seq at trainingtime, in our framework they can naturally be addedto the succ function, allowing us to train with hardconstraints; we experiment along these lines in Sec-tion 5.3.

2We found that training with early update rather than (de-layed) LaSO did not work well, even after pre-training. Giventhe success of early update in many NLP tasks this was some-what surprising. We leave this question to future work.

Time Step

a red dog smells home today

the dog dog barks quickly Friday

red blue cat barks straight now

runs today

a red dog runs quickly today

blue dog barks home today

Figure 1: Top: possible y(k)1:t formed in training with a

beam of size 3 and with gold sequence y1:6 = “a red dogruns quickly today”. The gold sequence is highlighted inyellow, and the predicted prefixes involved in margin vio-lations (at t=4 and t=6) are in gray. Note that time-stepT =6 uses a different loss criterion. Bottom: prefixes thatactually participate in the loss, arranged to illustrate theback-propagation process.

Having defined an appropriate succ function, wecan specify the candidate set as:

St = topK

(succ(y1:t�1) violation at t�1

SKk=1 succ(y

(k)1:t�1) otherwise,

where we have a margin violation at t�1 ifff(yt�1,ht�2) < f(y

(k)t�1,

ˆ

h

(k)t�2)+1, and where topK

considers the scores given by f . This search proce-dure is illustrated in the top portion of Figure 1.

In the forward pass of our training algorithm,shown as the first part of Algorithm 1, we run thisversion of beam search and collect all sequences andtheir hidden states that lead to losses.

4.3 Backward: Merge SequencesOnce we have collected margin violations we canrun standard back-propagation through time to com-pute parameter updates. Assume a margin viola-tion occurs at time-step t between the predicted his-tory y

(K)1:t and the gold history y1:t. As in standard

seq2seq training we must back-propagate this errorthrough the gold history; however, unlike seq2seqwe also have a gradient for the wrongly predictedhistory.

In the worst case, there is one violation at eachtime-step, which could lead to T independent se-quences. Since we need to call BRNNO(T ) timesfor each sequence, naively running this every timethere was a violation could lead to an O(T 2

) back-ward pass, rather than the O(T ) time required forthe standard seq2seq approach.


• Better RNN’s units and connections are produced,however, their impacts are small now(compared to ones between “vanilla and LSTM” or “1-layer or multi-layer”.)

• Analysis is more needed in general and each task• Designing models with (reasonable) idea may good

result, e.g., tree composition• Regularization and learning tricks increased• Other decoding training or inference algorithms are

required

Summary


recent progress in rnn and nlp

Technology