deep learning for natural language processingsergey/slides/n17_aistdlnlp.pdf · deeplearning...

129
deep learning for natural language processing Sergey I. Nikolenko 1 AIST 2017 Moscow, July 28, 2017 1 Steklov Institute of Mathematics at St. Petersburg Random facts: on July 28, 1814, Percy Bysshe Shelley, married to Harriet Westbrook at the time, ran from England to France with a 16-year-old Mary Wollstonecraft Godwin, later Mary Shelley on July 28, 1974, KGB chairman Yuri Andropov created the Alpha Group (Spetsgruppa A) July 28 -- Russia's Baptism Day; the Duma passed this bill on May 21, 2010 July 28 -- World Hepatitis Day; World Health Assembly passed this resolution on May 21, 2010

Upload: others

Post on 03-Jun-2020

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

deep learningfor natural language processing

Sergey I. Nikolenko1

AIST 2017Moscow, July 28, 2017

1Steklov Institute of Mathematics at St. Petersburg

Random facts:

• on July 28, 1814, Percy Bysshe Shelley, married to Harriet Westbrook at the time, ran fromEngland to France with a 16-year-old Mary Wollstonecraft Godwin, later Mary Shelley

• on July 28, 1974, KGB chairman Yuri Andropov created the Alpha Group (Spetsgruppa A)

• July 28 -- Russia's Baptism Day; the Duma passed this bill on May 21, 2010

• July 28 -- World Hepatitis Day; World Health Assembly passed this resolution on May 21, 2010

Page 2: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

plan

• The deep learning revolution has not left natural languageprocessing alone.

• DL in NLP has started with standard architectures (RNN, CNN)but then has branched out into new directions.

• Our plan for today:(1) very brief intro to neural networks and deep learning;(2) intro to distributed word representations;(3) a primer on sentence embeddings and character-level models;(4) a ((very-)very) brief overview of the most promising directions in

modern NLP based on deep learning.

• We will concentrate on directions that have given rise to newmodels and architectures.

2

Page 3: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

neural networks

Page 4: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

machine learning as optimization

• Machine learning has many kinds of problems: supervised,unsupervised and so on.

• They are usually solved with Bayes’ Theorem, recomputing ourprior ideas about parameters into posteriors.

𝑝(𝜃|𝐷) = 𝑝(𝜃)𝑝(𝐷|𝜃)𝑝(𝐷) .

4

Page 5: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

machine learning as optimization

• This usually means optimizing a multivariate function.• Algorithmically, machine learning is about approximation andoptimization of complicated functions (likelihoods, posteriors,error functions etc.).

• A deep neural network is a big and complicated function ofmany variables, highly nonconvex.

• For nonconvex functions this is usually done with some versionof gradient descent.

4

Page 6: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

gradient descent

• Gradient descent is the main and virtually only way to optimizereally complicated functions.

• We take the gradient ∇𝐸(𝑤), move along it a little, and repeat.• And in neural networks gradients are easy to compute becauseNNs are just huge compositions of very simple elements.

5

Page 7: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

perceptron

• Perceptron is the building block of neural networks:𝑦 = ℎ(𝑤⊤𝑥) = ℎ (∑𝑖 𝑤𝑖𝑥𝑖) .

• Different nonlinear functions:

6

Page 8: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

uniting perceptrons into networks

• A network of perceptrons.• Hornik, 1990: two levels suffice to approximate any function.• But deep networks are more efficient — the power of distributedrepresentations.

• Usually neural networks are organized into layers.

7

Page 9: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

computational graph

• Suppose we have represented a complex function as acomposition of simple ones.

• E.g., 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑥𝑦 + (𝑥 + 𝑦)2:

8

Page 10: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

computational graph

• We can now compute derivatives on the computational graph.• Forward propagation: derivatives of every node by a given one(input).

• Backprop: derivatives of one node (output) by every other.• We can now compute gradients of any, arbitrarily complicatedfunctions.

• Backprop gets us derivatives of the error function for gradientdescent.

• theano, TensorFlow, Torch are automatic differentiation libraries.

8

Page 11: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

gradient descent in nns

• A neural network has a lot of parameters:• regularization (𝐿2, 𝐿1, dropout);• initialization (unsupervised pretraining, random init);• normalization (batchnorm, layer normalization);• improved gradient descent (moments, Nesterov momentum,Adadelta, Adam).

• Two main classes of architectures, both from the 1980s:• convolutional neural networks;• recurrent neural networks.

9

Page 12: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

gradient descent in nns

• A neural network has a lot of parameters:• 𝐿2- and 𝐿1-regularization;• early stopping;• dropout.

10

Page 13: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

gradient descent in nns

• These parameters have to be initialized:• unsupervised pretraining;• random initialization:

• Xavier initialization (Glorot, Bengio, 2010);• He initialization (He et al., 2015);

• the idea is to control variances so that the gradients do not vanishalong the layers.

10

Page 14: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

gradient descent in nns

• Batch normalization:• covariate shift — next level inputs change when the previous onechanges;

• would be good to normalize over the whole dataset, but too hard;• hence, normalize by mini-batches, with additional parameters 𝛾, 𝛽for more expressivity.

10

Page 15: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

gradient descent in nns

• Gradient descent improvements:• gradually reducing learning rate;• method of moments, Nesterov momentum;• adaptive gradient descent (Adagrad, Adadelta, RMSprop, Adam).

10

Page 16: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

convolutional neural networks

• Basic convolutional layer: convolution, nonlinearity, pooling.

• Classical architecture – LeNet:

11

Page 17: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

convolutional neural networks

• Convolutional nets are usually very deep (AlexNet):

• Usually trained on ImageNet (> 15M images, ≈ 22K classes).

11

Page 18: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

convolutional neural networks

• VGG: it suffices to take 3 × 3 convolutions:

• Network in Network, Inception:

11

Page 19: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

convolutional neural networks

• Residual connections (ResNet) are a very important idea; led toa “revolution of depth”.

11

Page 20: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

convolutional neural networks

• Other objective functions can be added (e.g., segmentation):

• In general, it is very easy to combine architectures: justconcatenate inputs or add parallel layers that do somethingnew.

• We will see lots of examples of this in NLP.

11

Page 21: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recurrent neural networks

• Various kinds of problems are based on sequences:

• A “simple” RNN:

12

Page 22: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recurrent neural networks

• Bidirectional RNN:

• Two problems with RNN:• exploding gradients;• vanishing gradients.

12

Page 23: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recurrent neural networks

• Solving exploding gradients is easy: simply clip the gradients.

12

Page 24: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recurrent neural networks

• Vanishing gradients: you always kind of “multiply by 𝑊 ”, by thematrix of recurrent weights.

• This does not let us implement long-term memory.

• We need to let the gradients flow freely.

12

Page 25: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recurrent neural networks

• Constant error carousel – LSTM:

12

Page 26: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recurrent neural networks

• Or simplified constructions, e.g., GRU:

12

Page 27: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

a typical recurrent network

• A simple seq2seq architecture (Sutskever et al. 2014):

13

Page 28: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

a typical recurrent network

• A slighly more complicated seq2seq architecture:

13

Page 29: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

autoencoders

• Autoencoders: how to do unsupervised learning with NNs.Reconstructing input by output:

• Why can’t the network simply copy input to output?

14

Page 30: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

autoencoders

• Undercomplete vs. overcomplete autoencoders.• Regularization helps a lot, especially dropout.• Denoising autoencoders – add noise to input, reconstruct trueinput:

14

Page 31: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

other stuff

• Many other directions we have not touched:• deep reinforcement learning;

• generative networks, in particular GANs;

• Bayesian inference and neural networks;• …

• But now let us proceed to NLP. 15

Page 32: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

nlp problems

Page 33: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

nlp problems

• Syntactic, more or less well-defined problems:• part-of-speech tagging;• morphological segmentation;• stemming or lemmatization;• sentence boundary disambiguation;• word segmentation;• named entity recognition;• word sense disambiguation;• syntactic parsing;• coreference resolution.

17

Page 34: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

nlp problems

• But this also requires understanding in hard cases:

• Anaphora resolution:• «the laptop did not fit in the bag because it was too small»;• «the laptop did not fit in the bag because it was too big».

17

Page 35: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

nlp problems

• More complex problems that require understanding even moreoften, but we still know the right answers and can get qualitymetrics:

• language models;• sentiment analysis;• relationship extraction, fact extraction;• question answering.

17

Page 36: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

nlp problems

• Problems where we not only understand text but try to generatenew text:

• text generation per se;• automatic summarization;• machine translation;• dialog and conversational models.

• There are deep learning models for all these problems.

17

Page 37: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word embeddings,sentence embeddings,and character-level models

Page 38: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word embeddings

• Distributional hypothesis in linguistics: words with similarmeaning will occur in similar contexts.

• Distributed word representations map words to a Euclideanspace (usually of dimension several hundred):

• started in earnest in (Bengio et al. 2003; 2006), although therewere earlier ideas;

• word2vec (Mikolov et al. 2013): train weights that serve best forsimple prediction tasks between a word and its context:continuous bag-of-words (CBOW) and skip-gram;

• Glove (Pennington et al. 2014): train word weights to decomposethe (log) cooccurrence matrix.

19

Page 39: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word embeddings

• The CBOW word2vec model operates as follows:• inputs are one-hot word representations of dimension 𝑉 ;• the hidden layer is the matrix of vector embeddings 𝑊 ;• the hidden layer’s output is the average of input vectors;• as output we get an estimate 𝑢𝑗 for each word, and

�̂�(𝑖|𝑐1, … , 𝑐𝑛) = exp(𝑢𝑗)∑𝑉

𝑗′=1 exp(𝑢𝑗′ ).

19

Page 40: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word embeddings

• In skip-gram, it’s the opposite:• we predict each context word from the central word;• so now there are several multinomial distributions, one softmaxfor each context word:

�̂�(𝑐𝑘|𝑖) =exp(𝑢𝑘𝑐𝑘 )

∑𝑉𝑗′=1 exp(𝑢𝑗′ )

.

19

Page 41: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word embeddings

• How do we train a model like that?• E.g., in skip-gram we choose 𝜃 to maximize

𝐿(𝜃) = ∏𝑖∈𝐷

⎛⎜⎝

∏𝑐∈𝐶(𝑖)

𝑝(𝑐 ∣ 𝑖; 𝜃)⎞⎟⎠

= ∏(𝑖,𝑐)∈𝐷

𝑝(𝑐 ∣ 𝑖; 𝜃),

and we parametrize

𝑝(𝑐 ∣ 𝑖; 𝜃) = exp(�̃�⊤𝑐 𝑤𝑖)

∑𝑐′ exp(�̃�⊤𝑐′𝑤𝑖)

.

20

Page 42: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word embeddings

• This leads to the total likelihood

arg max𝜃

∏(𝑖,𝑐)∈𝐷

𝑝(𝑐 ∣ 𝑖; 𝜃) = arg max𝜃

∑(𝑖,𝑐)∈𝐷

𝑝(𝑐 ∣ 𝑖; 𝜃) =

= arg max𝜃

∑(𝑖,𝑐)∈𝐷

(exp(�̃�⊤𝑐 𝑤𝑖) − log ∑

𝑐′exp(�̃�⊤

𝑐′𝑤𝑖)) ,

which we maximize with negative sampling.• Question: why do we need separate �̃� and 𝑤 vectors?• Live demo: nearest neighbors, simple geometric relations.

20

Page 43: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

glove

• GloVe – we are trying to approximate the cooccurrence matrix𝑋 ∈ ℝ𝑉 ×𝑉 :

𝑝𝑖𝑗 = 𝑝(𝑗 ∣ 𝑖) = 𝑋𝑖𝑗𝑋𝑖

= 𝑋𝑖𝑗∑𝑘 𝑋𝑖𝑘

.

• More precisely, the ratios 𝑝𝑖𝑗𝑝𝑘𝑗.

• Example from the Russian wiki:No. of occurrences Probabilities Ratio

Word 𝑘 Total Together with: 𝑝(𝑘 ∣ …), ×10−4 𝑝(𝑘∣клуб)𝑝(𝑘∣команда)

клуб команда клуб командафутбол 29988 54 34 18.0 11.3 1.588хоккей 10957 16 7 6.39 14.6 2.286гольф 2721 11 1 40.4 3.68 11.0корабль 100127 0 30 0.0 3.00 0.0

21

Page 44: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

glove

• We train the function 𝐹(𝑤𝑖, 𝑤𝑗; �̃�𝑘) = 𝑝𝑖𝑗𝑝𝑘𝑗

.• Even simpler – we train

𝐹((𝑤𝑖 − 𝑤𝑗)⊤ �̃�𝑘) = 𝐹 (𝑤⊤

𝑖 �̃�𝑘)𝐹 (𝑤⊤

𝑗 �̃�𝑘)= 𝑝𝑖𝑗

𝑝𝑘𝑗.

• This should actually be an exponent:

𝑤⊤𝑖 �̃�𝑘 = log(𝑝𝑖𝑘) = log(𝑋𝑖𝑘) − log(𝑋𝑖).

• We can hide log(𝑋𝑖) in bias terms 𝑏𝑖.• And the objective function for GloVe will be

𝐽 =𝑉

∑𝑖,𝑗=1

𝑓(𝑋𝑖𝑗) (𝑤⊤𝑖 �̃�𝑗 + 𝑏𝑖 + �̃�𝑗 − log 𝑋𝑖𝑗)

2 .

21

Page 45: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

how to use word vectors

• Next we can use recurrent architectures on top of word vectors.• E.g., LSTMs for sentiment analysis:

• Train a network of LSTMs for language modeling, then use eitherthe last output or averaged hidden states for sentiment.

22

Page 46: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

how to use word vectors

• Or a CNN with one-dimensional convolutions:

22

Page 47: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

up and down from word embeddings

• Word embeddings are the first step of most DL models in NLP.• But we can go both up and down from word embeddings.• First, a sentence is not necessarily the sum of its words.• Second, a word is not quite as atomic as the word2vec modelwould like to think.

23

Page 48: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• The simplest idea is to use the sum and/or mean of wordembeddings to represent a sentence/paragraph:

• a baseline in (Le and Mikolov 2014);• a reasonable method for short phrases in (Mikolov et al. 2013)• shown to be effective for document summarization in (Kageback etal. 2014).

24

Page 49: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• Distributed Memory Model of Paragraph Vectors (PV-DM) (Le andMikolov 2014):

• a sentence/paragraph vector is an additional vector for eachparagraph;

• acts as a “memory” to provide longer context;

24

Page 50: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• Distributed Bag of Words Model of Paragraph Vectors (PV-DBOW)(Le and Mikolov 2014):

• the model is forced to predict words randomly sampled from aspecific paragraph;

• the paragraph vector is trained to help predict words from thesame paragraph in a small window.

24

Page 51: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• A number of convolutional architectures (Ma et al., 2015;Kalchbrenner et al., 2014).

• (Kiros et al. 2015): skip-thought vectors capture the meanings ofa sentence by training from skip-grams constructed onsentences.

• (Djuric et al. 2015): model large text streams with hierarchicalneural language models with a document level and a token level.

24

Page 52: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recursive neural networks

• Recursive neural networks (Socher et al., 2012):• a neural network composes a chunk of text with another part in atree;

• works its way up from word vectors to the root of a parse tree.

25

Page 53: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

recursive neural networks

• Recursive neural networks (Socher et al., 2012):• representing a node with a matrix-vector pair;• by training this in a supervised way, one can get a very effectiveapproach to sentiment analysis (Socher et al. 2013).

25

Page 54: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

deep recursive networks

• Deep recursive networks for sentiment analysis (Irsoy, Cardie,2014).

• First idea: decouple leaves and internal nodes.• In recursive networks, we apply the same weights throughoutthe tree:

𝑥𝑣 = 𝑓(𝑊𝐿𝑥𝑙(𝑣) + 𝑊𝑅𝑥𝑟(𝑣) + 𝑏).• Now, we use different matrices for leaves (input words) andhidden nodes:

• we can now have fewer hidden units than the word vectordimension;

• we can use ReLU: sparse inputs and dense hidden units do notcause a discrepancy.

26

Page 55: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

deep recursive networks

• Second idea: add depth to get hierarchical representations:

ℎ(𝑖)𝑣 = 𝑓(𝑊 (𝑖)

𝐿 ℎ(𝑖)𝑙(𝑣) + 𝑊 (𝑖)

𝑅 ℎ(𝑖)𝑟(𝑣) + 𝑉 (𝑖)ℎ(𝑖−1)

𝑣 + 𝑏(𝑖)).

• Great architecture for sentiment analysis... if you have the parsetrees.

• Stanford Sentiment TreeBank for English. 26

Page 56: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

character-level models

• Word embeddings have important shortcomings:• vectors are independent but words are not; consider, in particular,morphology-rich languages like Russian;

• the same applies to out-of-vocabulary words: a word embeddingcannot be extended to new words;

• word embedding models may grow large; it’s just lookup, but thewhole vocabulary has to be stored in memory with fast access.

• E.g., “polydistributional” gets 48 results on Google, so youprobably have never seen it, and there’s very little training data:

• Do you have an idea what it means? Me too.27

Page 57: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

character-level models

• Hence, character-level representations:• began by decomposing a word into morphemes (Luong et al. 2013;Botha and Blunsom 2014; Soricut and Och 2015);

• but this adds errors since morphological analyzers are alsoimperfect, and basically a part of the problem simply shifts totraining a morphology model;

• two natural approaches on character level: LSTMs and CNNs;• in any case, the model is slow but we do not have to apply it toevery word, we can store embeddings of common words in alookup table as before and only run the model for rare words – anice natural tradeoff.

27

Page 58: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

character-level models

• C2W (Ling et al. 2015) is based on bidirectional LSTMs:

27

Page 59: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

character-level models

• The approach of Deep Structured Semantic Model (DSSM)(Huang et al., 2013; Gao et al., 2014a; 2014b):

• sub-word embeddings: represent a word as a bag of trigrams;• vocabulary shrinks to |𝑉 |3 (tens of thousands instead of millions),but collisions are very rare;

• the representation is robust to misspellings (very important foruser-generated texts).

27

Page 60: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

character-level models

• ConvNet (Zhang et al. 2015): text understanding from scratch,from the level of symbols, based on CNNs.

• Character-level models and extensions to appear to be veryimportant, especially for morphology-rich languages likeRussian.

27

Page 61: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

modern char-based language model: kim et al., 2015

Sample modern character-based language model (Kim et al., 2015):

Unites CNN, RNN, highway networks, embeddings...

28

Page 62: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word vectors with external information

• Other modifications of word embeddings add externalinformation.

• E.g., the RC-NET model (Xu et al. 2014) extends skip-grams withrelations (semantic and syntactic) and categorical knowledge(sets of synonyms, domain knowledge etc.).

29

Page 63: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word vectors with external information

• The basic word2vec model gets a regularizer for every relationthat tries to bring it closer to a linear relation between thevectors, so that, e.g.,

𝑤Hinton − 𝑤Wimbledon ≈ 𝑟born at ≈ 𝑤Euler − 𝑤Basel

29

Page 64: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

word sense disambiguation

• Another important problem with both word vectors andchar-level models: homonyms.

• How do we distinguish different senses of the same word?• the model usually just chooses one meaning;• e.g., let’s check nearest neighbors for the word коса and otherhomonyms.

• We have to add latent variables for different meaning and inferthem from context.

• To train the meanings with latent variables — Bayesian inferencewith stochastic variational inference (Bartunov et al., 2015).

30

Page 65: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

general approaches

Page 66: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

text generation with rnns

• Language modeling and text generation is a natural directapplication of NN-based NLP; word embeddings started as a“neural probabilistic language model” (Bengio et al., 2003).

• First idea – sequence learning with RNNs/LSTMs.• Surprisingly, simple RNNs can produce quite reasonably-lookingtext even by training character by character, with no knowledgeof the words (“The Unreasonable Effectiveness...”), including thefamous example from (Sutskever et al. 2011):The meaning of life is the tradition of the ancient human reproduction: it is lessfavorable to the good boy for when to remove her bigger...

• This is, of course, not “true understanding” (whatever thatmeans), only short-term memory effects.

• We need to go deeper in terms of both representations andsequence modeling.

32

Page 67: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

text generation with rnns

• One can change diversity (sampling temperature) and get different styles ofabsurdistic texts. Random example with seed «обещал, на рождество, носпустя семь лет. имя глав».

• Low diversity gets you Lucky’s monologue from “Waiting for Godot”:

обещал, на рождество, но спустя семь лет. имя главного командования в составесоветского союза с 1976 года. после проведения в 1992 году в составе советскогосоюза (1977). в 1967 году в составе советского союза состоялся в 1952 году в составесоветско-финской войны 1877 - 1877 годов. в 1966 году в составе советского союзас 1965 года по 1977 год...

• Mid-range diversity produces something close to meaningful text:

обещал, на рождество, но спустя семь лет. имя главного рода собственно вновьобразовалась в россии и народном состоянии. после присказа с постановлениемсоюза писателей россии и генеральной диссертации о спортивном училище с 1980года. в 1970-х годах был основан в составе комитета высшего совета театра впольши. в 1957 - 1962 годах - начальник батальона сан-аухаров...

• High diversity leads to Khlebnikov’s zaum:

обещал, на рождество, но спустя семь лет. имя главы философии пововпели nol-lнози - врайу-7 на луосече. человеческая восстания покторов извоенного чомпдеи э. дроссенбурга, … карл уним-общекрипских. эйелем хфечак от этого спискасравнивала имущно моря в юнасториансический индристское носительскихженатов в церкви испании.... 33

Page 68: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

text generation with rnn

• Another example – «12 chairs», 3 layers of LSTM of dim. 128.

• Low diversity:

– вы думаете, что он подвергается опасности? не понимаете на девушки и со всегобольшого секретара. на поставитель из столики с колодции под собой по столовомпод нарипальное одного обедать вы получить стулья. но все не собирался. подводой под события не подошел к двери. он серебренной при столики под водомворобьяниновской порочение и подошел к стулом.

• Average diversity:

– что это значит?– спросил ипполит матвеевич, вспоминая только подкладка,идиость выкрасть, что уже совершенно всего упасы, по рексе оборанный решалина ним ответственное колоно горячи облиганта ветерность ”правосудель” застояли пределицу и из одобрания из на порахнитостью. но кричался воему тогу.его не смотрел ордеров с мы толстений принимать выдержание то преходитель.

• High diversity:

– ну, и я вы умоли полтуча,– сказал остап, нади гадалкий во столбор не черта ненадо предражало. ответил золотый и стулья и нов. срековое зарабоварил стооспастук, и обычно, и строи тираживым господура моя животую столу, почто неуличного беспарные такиме судьберского есть денегальный извер.

34

Page 69: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

text generation with rnn

• Last example – 3 layers of dim. 128, «Eugene Onegin».

• Low diversity – basically reciting:

но вот уж близко. перед нимиуж белокаменной москвыкак жар, крестами золотымигорят старинные главы.

• Average diversity – large verbatim quotes:

не правда ль? вам была не новостьсмиренной девочки, поврамаон был любим... по крайней меретак думал он на супруге.

• High diversity – Khlebnikov again:

простой живеть по полном в,бал уж канит; три несаладо глаза подерень преданьемпоедет, смертаю себя.

35

Page 70: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

poroshok

• And here are some poroshki generated with LSTMs from arelatively small dataset:

заходит к солнцу отдаётесьчто он летел а может бытьи вовсе не веду на стенкена поле пять и новый годи почему то по башкев квартире голуби и болии повзрослел и умирать

страшней всего когда ты выпилбез показания зонта

однажды я тебя не вышлои ты

я захожу в макдоналистунадену отраженный дождьпод ужин почему местамии вдруг подставил человек

ты мне привычно верил крышудо дна

я подползает под кроватьючтоб он исписанный пингвини ты мне больше никогдано мы же после русских классикбарто солдаты для любви

36

Page 71: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dssm

• A general approach to NLP based on CNNs is given by DeepStructured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,2014a; 2014b):

• one-hot target vectors for classification (speech recognition, imagerecognition, language modeling).

37

Page 72: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dssm

• A general approach to NLP based on CNNs is given by DeepStructured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,2014a; 2014b):

• vector-valued targets for semantic matching.

37

Page 73: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dssm

• A general approach to NLP based on CNNs is given by DeepStructured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,2014a; 2014b):

• can capture different targets (one-hot, vector);• to train with vector targets – reflection: bring source and targetvectors closer.

37

Page 74: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dssm

• A general approach to NLP based on CNNs is given by DeepStructured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,2014a; 2014b):

• DSSMs can be applied in a number of different contexts whenwe can specify a supervised dataset:

• semantic word embeddings: word by context;• web search: web documents by query;• question answering: knowledge base relation/entity by pattern;• recommendations: interesting documents by read/likeddocuments;

• translation: target sentence by source sentence;• text/image: labels by images or vice versa.

• Basically, this is an example of a general architecture that canbe trained to do almost anything.

37

Page 75: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dssm

• A general approach to NLP based on CNNs is given by DeepStructured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,2014a; 2014b):

• Deep Structured Semantic Models (DSSM) (Huang et al., 2013;Gao et al., 2014a; 2014b): a deep convolutional architecturetrained on similar text pairs.

• Can be used for information retrieval: model relevance bybringing relevant documents closer to their queries (bothdocument and query go through the same convolutionalarchitecture).

• November 2016: a post by Yandex saying that they use(modified) DSSM in their new Palekh search algorithm.

37

Page 76: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dependency parsing

Page 77: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dependency parsing

• We mentioned parse trees; but how do we construct them?

39

Page 78: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dependency parsing

• Current state of the art – continuous-state parsing: current stateis encoded in ℝ𝑑.

• Stack LSTMs (Dyer et al., 2015) – the parser manipulates threebasic data structures:(1) a buffer 𝐵 contains the sequence of words;(2) a stack 𝑆 stores partially constructed parses;(3) a list 𝐴 of actions already taken by the parser;

39

Page 79: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dependency parsing with morphology

• Important extension – (Ballesteros et al., 2015):• in morphologically rich natural languages, we have to take intoaccount morphology;

• so they represent the words by bidirectional character-level LSTMs;• report improved results in Arabic, Basque, French, German,Hebrew, Hungarian, Korean, Polish, Swedish, and Turkish;

• this direction probably can be further improved (and where’sRussian in the list above?..).

40

Page 80: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

evaluation for sequence-to-sequence models

• Next we will consider specific models for machine translation,dialog models, and question answering.

• But how do we evaluate NLP models that produce text?• Quality metrics for comparing with reference sentencesproduced by humans:

• BLEU (Bilingual Evaluation Understudy): reweighted precision (incl.multiple reference translations);

• METEOR: harmonic mean of unigram precision and unigram recall;• TER (Translation Edit Rate): number of edits between the outputand reference divided by the average number of reference words;

• LEPOR: combine basic factors and language metrics with tunableparameters.

• The same metrics apply to paraphrasing and, generally, allproblems where the (supervised) answer should be a free-formtext.

• There is one problem...

41

Page 81: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

evaluation for sequence-to-sequence models

• They don’t work at all!

• Well, actually they do work, but it’s more complicated than itseems.

41

Page 82: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

machine translation

Page 83: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

machine translation

• Translation is a very convenient problem for modern NLP:• on one hand, it is very practical, obviously important;• on the other hand, it’s very high-level, virtually impossible withoutdeep understanding, so if we do well on translation, we probablydo something right about understanding;

• on the third hand (oops), it’s quantifiable (BLEU, TER etc.) and hasrelatively large available datasets (parallel corpora).

43

Page 84: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

machine translation

• Statistical machine translation (SMT): model conditionalprobability 𝑝(𝑦 ∣ 𝑥) of target 𝑦 (translation) given source 𝑥 (text).

• Classical SMT: model log 𝑝(𝑦 ∣ 𝑥) with a linear combination offeatures and then construct these features.

• NNs have been used both for reranking the best lists of possibletranslations and as part of feature functions:

43

Page 85: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

machine translation

• NNs are still used for feature engineering with state of the artresults, but here we are more interested insequence-to-sequence modeling.

• Basic idea:• RNNs can be naturally used to probabilistically model a sequence

𝑋 = (𝑥1, 𝑥2, … , 𝑥𝑇 ) as 𝑝(𝑥1), 𝑝(𝑥2 ∣ 𝑥1), …,𝑝(𝑥𝑇 ∣ 𝑥<𝑇 ) = 𝑝(𝑥𝑇 ∣ 𝑥𝑇−1, … , 𝑥1), and then the joint probability𝑝(𝑋) is just their product𝑝(𝑋) = 𝑝(𝑥1)𝑝(𝑥2 ∣ 𝑥1) … 𝑝(𝑥𝑘 ∣ 𝑥<𝑘) … 𝑝(𝑥𝑇 ∣ 𝑥<𝑇 );

• this is how RNNs are used for language modeling;• we predict next word based on the hidden state learned from allprevious parts of the sequence;

• How do we apply this idea to SMT?

43

Page 86: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

encoder-decoder architectures

• Encoder-decoder architectures (Sutskever et al., 2014; Cho et al.,2014):

• First code, then decode back.

44

Page 87: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

encoder-decoder architectures

• The same idea works with translation.

• Problem: we need to compress the entire sentence into a singlevector.

• And it does not work at all with longer fragments...

44

Page 88: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• Possible solution: train special weights that show howimportant a certain part of the input is for the currentlygenerated part of the output.

• This is somewhat similar to human attention: what do we putinto working memory?

• First applications in NNs – foveal glimpses with RBMs(Larochelle, Hinton, 2010)

45

Page 89: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• A direct application – bidirectional LSTM + attention (Bahdanauet al. 2014):

45

Page 90: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• Soft attention (Luong et al. 2015a; 2015b; Jean et al. 2015):• encoder is a bidirectional RNN;• attention network estimates relevance: are we translating thisword right now?

45

Page 91: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• Formally very simple: compute attention weights 𝛼𝑡𝑗 andre-weigh context vectors:

𝑒𝑡𝑗 = 𝑎(𝑧𝑡−1, 𝑗), 𝛼𝑡𝑗 = softmax(𝑒𝑡𝑗; 𝑒𝑡∗),𝑐𝑡 = ∑

𝑗𝛼𝑡𝑗ℎ𝑗, and now 𝑧𝑡 = 𝑓(𝑠𝑡−1, 𝑦𝑡−1, 𝑐𝑖).

45

Page 92: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• As a result we can visualize what the network is looking at:

45

Page 93: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• The word order is much better this way:

45

Page 94: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• Other applications of attention mechanisms can be NLP-relatedtoo.

• Show, Attend, and Tell (Xu et al., 2015): descriptions of images.

45

Page 95: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• Soft attention vs. hard attention (stochastically choose aspecific part of the image):

• Hard attention is trained by maximizing a variational lowerbound.

45

Page 96: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• Often pretty good results:

45

Page 97: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention in neural networks

• Even closer – «Grammar as a Foreign Language» (Vinyals et al.,2015)

45

Page 98: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

google translate

• September 2016: Wu et al., Google’s Neural Machine TranslationSystem: Bridging the Gap between Human and MachineTranslation:

• shows how Google Translate actually works;• the basic architecture is the same: encoder, decoder, attention;• RNNs have to be deep enough to capture language irregularities,so 8 layers for encoder and decoder each:

46

Page 99: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

google translate

• September 2016: Wu et al., Google’s Neural Machine TranslationSystem: Bridging the Gap between Human and MachineTranslation:

• but stacking LSTMs does not really work: 4-5 layers are OK, 8layers don’t work;

• so they add residual connections between the layers, similar to(He, 2015):

46

Page 100: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

google translate

• September 2016: Wu et al., Google’s Neural Machine TranslationSystem: Bridging the Gap between Human and MachineTranslation:

• and it makes sense to make the bottom layer bidirectional inorder to capture as much context as possible:

46

Page 101: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

google translate

• September 2016: Wu et al., Google’s Neural Machine TranslationSystem: Bridging the Gap between Human and MachineTranslation:

• GNMT also uses two ideas for word segmentation:• wordpiece model: break words into wordpieces (with a separatemodel); example from the paper:

Jet makers feud over seat width with big orders at stake

becomes

_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

• mixed word/character model: use word model but forout-of-vocabulary words convert them into characters (specificallymarked so that they cannot be confused); example from the paper:

Miki becomes <B>M <M>i <M>k <E>i46

Page 102: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

teaching machines to read

• (Hermann et al., 2015): «Teaching machines to read andcomprehend» (Google DeepMind)

• A new way to construct a dataset for understanding byautomated construction of (context, query, answer) triples fromnews items or similar texts.

47

Page 103: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

teaching machines to read

• The model is based on a deep network of LSTMs:

• But it does not work very well this way.

47

Page 104: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

teaching machines to read

• Attentive Reader – train which part of the document to look at:

47

Page 105: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

teaching machines to read

• Impatient Reader – re-read parts of the document as the queryis processed:

47

Page 106: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

teaching machines to read

• Reasonable attention maps:

47

Page 107: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention is all you need

• June 2017: «Attention is all you need» (Vaswani et al., Google)

48

Page 108: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

attention is all you need

• Nothing but attention!• Parallel attention maps are combined into matrices:

• Self-attention: each encoder position can “attend to” eachposition of the previous level.

• SMT results improve over state of the art, and training is 100xfaster.

48

Page 109: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dialog and conversation

Page 110: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dialog and conversational models

• Dialog models attempt to model and predict dialogue;conversational models actively talk to a human.

• Applications – automatic chat systems for business etc.• Vinyals and Le (2015) use seq2seq (Sutskever et al. 2014):

• feed previous sentences ABC as context to the RNN;• predict the next word of reply WXYZ based on the previous wordand hidden state.

• Datasets: general (MovieSubtitles) or domain-specific (IThelpdesk). 50

Page 111: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dialog and conversational models

• Hierarchical recurrent encoder decoder architecture (HRED); firstproposed for query suggestion in IR (Sordoni et al. 2015), usedfor dialog systems in (Serban et al. 2015).

• The dialogue as a two-level system: a sequence of utterances,each of which is in turn a sequence of words. To model thistwo-level system, HRED trains:(1) encoder RNN that maps each utterance in a dialogue into a single

utterance vector;(2) context RNN that processes all previous utterance vectors and

combines them into the current context vector;(3) decoder RNN that predicts the tokens in the next utterance, one at

a time, conditional on the context RNN.

50

Page 112: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dialog and conversational models

• HRED architecture:

50

Page 113: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

dialog and conversational models

• Some recent developments:• (Li et al., 2016a) apply, again, reinforcement learning (DQN) toimprove dialogue generation;

• (Li et al., 2016b) add personas with latent variables, so dialoguecan be more consistent (yes, it’s the same Li);

• (Wen et al., 2016) use snapshot learning, adding some weaksupervision in the form of particular events occurring in theoutput sequence (whether we still want to say something or havealready said it);

• (Su et al., 2016) improve dialogue systems with online activereward learning, a tool from reinforcement learning.

• Generally, chatbots are becoming commonplace but it is still along way to go before actual general-purpose dialogue.

50

Page 114: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

Page 115: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• Question answering (QA) is one of the hardest NLP challenges,close to true language understanding.

• Let us begin with evaluation:• it’s easy to find datasets for information retrieval;• these questions can be answered knowledge base approaches:map questions to logical queries over a graph of facts;

• in a multiple choice setting (Quiz Bowl), map the question andpossible answers to a semantic space and find nearest neighbors(Socher et al. 2014);

• but this is not exactly general question answering.

• (Weston et al. 2015): a dataset of simple (for humans) questionsthat do not require any special knowledge.

• But require reasoning and understanding of semantic structure...

52

Page 116: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• Sample questions:Task 1: Single Supporting Fact

Mary went to the bathroom.John moved to the hallway.Mary travelled to the office.Where is Mary?A: office

Task 4: Two Argument RelationsThe office is north of the bedroom.The bedroom is north of the bathroom.The kitchen is west of the garden.What is north of the bedroom? A: officeWhat is the bedroom north of? A: bathroom

Task 7: CountingDaniel picked up the football.Daniel dropped the football.Daniel got the milk.Daniel took the apple.How many objects is Daniel holding?A: two

Task 10: Indefinite KnowledgeJohn is either in the classroom or the playground.Sandra is in the garden.Is John in the classroom?A: maybeIs John in the office?A: no

Task 15: Basic DeductionSheep are afraid of wolves.Cats are afraid of dogs.Mice are afraid of cats.Gertrude is a sheep.What is Gertrude afraid of?A: wolves

Task 20: Agent’s MotivationsJohn is hungry.John goes to the kitchen.John grabbed the apple there.Daniel is hungry.Where does Daniel go? A: kitchenWhy did John go to the kitchen? A: hungry

52

Page 117: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• Naive approach – generate representations for question andanswer and combine them:

• One problem is that we have to remember the context setthroughout the whole question... 52

Page 118: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• ...so the current state of the art are memory networks (Weston etal. 2014).

• An array of objects (memory) and the following componentslearned during training:

I (input feature map) converts the input to the internal featurerepresentation;

G (generalization) updates old memories after receiving new input;O (output feature map) produces new output given a new input anda memory state;

R (response) converts the output of O into the output responseformat (e.g., text).

52

Page 119: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• Dynamic memory networks (Kumar et al. 2015).• Episodic memory unit that chooses which parts of the input tofocus on with an attention mechanism:

52

Page 120: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• End-to-end memory networks (Sukhbaatar et al. 2015).• A continuous version of memory networks, with multiple hops(computational steps) per output symbol.

• Regular memory networks require supervision on each layer;end-to-end ones can be trained with input-output pairs:

52

Page 121: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• Should work like this:

52

Page 122: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• Generally, the idea of training control algorithms is veryimportant now.

• Neural Turing Machines (Graves et al., 2014):

• Read/write/erase operations on memory with attention-likeweights.

52

Page 123: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• The weights are a kind of addressing mechanism:

• Can be trained with reinforcement learning (Zaremba, Sutskever,2015):

52

Page 124: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

question answering

• There are plenty of other extensions; one problem is how to linkQA systems with knowledge bases to answer questions thatrequire both reasoning and knowledge.

• I expect a lot of exciting things to happen here.• But allow me to suggest...

52

Page 125: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

what? where? when?

• «What? Where? When?»: a team game of answering questions.Sometimes it looks like this...

53

Page 126: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

what? where? when?

• ...but usually it looks like this:

53

Page 127: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

what? where? when?

• Teams of ≤ 6 players answer questions, whoever gets the mostcorrect answers wins.

• db.chgk.info – database of about 300K questions.• Some of them come from “Своя игра”, a Jeopardy clone butoften with less direct questions:

• ЛоготипыЕГО логотип — сочетание рун «беркана» и «хаглаз», инициаловконунга Харальда.

• СокращенияУчитывая ЕГО большую роль в создании первых танков, этитанки часто шутливо именовали ватерклозетами.

• ЛошадиЕГО попытка приударить за галисийскими кобылицамиокончилась печальным образом.

53

Page 128: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

what? where? when?

• Most are “Что? Где? Когда?” questions, even harder forautomated analysis:

• Однажды Ричард Фейнман объяснял своей девушке, почему Декартговорил глупости, когда доказывал существование Бога. Не желаяввязываться в спор, девушка сказала, что на любой предмет, видимо,можно посмотреть с разных сторон. Ответьте двумя словами: чтосделал Фейнман в ответ на это?

• Вторая соответствовала богохульству, третья — подделке денег илидокументов, четвёртая — пьянству. А чему соответствовала первая?

• На самом деле чехословацкая линейка аудиотехники ирадиокомпонентов получила название потому, что это слаботочнаятехника. Напишите это название.

• Одна из конструкций математической логики рассматриваетбессмертного агента, который в каждый момент времени должен либосовершить действие, либо формально доказать, что он совершит этодействие в один из последующих моментов времени. Название этойконструкции состоит из двух слов, начинающихся на одну и ту жебукву. Напишите это название.

• I believe it is a great and very challenging QA dataset.• How far in the future do you think it is? :)

53

Page 129: Deep Learning for Natural Language Processingsergey/slides/N17_AISTDLNLP.pdf · deeplearning fornaturallanguageprocessing SergeyI.Nikolenko1 AIST2017 Moscow,July28,2017 1SteklovInstituteofMathematicsatSt.Petersburg

thank you!

Thank you for your attention!

54