deep learning for natural language processingsergey/slides/n17_aistdlnlp.pdf · deeplearning...

deep learningfor natural language processing

Sergey I. Nikolenko1

AIST 2017Moscow, July 28, 2017

1Steklov Institute of Mathematics at St. Petersburg

Random facts:

• on July 28, 1814, Percy Bysshe Shelley, married to Harriet Westbrook at the time, ran fromEngland to France with a 16-year-old Mary Wollstonecraft Godwin, later Mary Shelley

• on July 28, 1974, KGB chairman Yuri Andropov created the Alpha Group (Spetsgruppa A)

• July 28 -- Russia's Baptism Day; the Duma passed this bill on May 21, 2010

• July 28 -- World Hepatitis Day; World Health Assembly passed this resolution on May 21, 2010

plan

• The deep learning revolution has not left natural languageprocessing alone.

• DL in NLP has started with standard architectures (RNN, CNN)but then has branched out into new directions.

• Our plan for today:(1) very brief intro to neural networks and deep learning;(2) intro to distributed word representations;(3) a primer on sentence embeddings and character-level models;(4) a ((very-)very) brief overview of the most promising directions in

modern NLP based on deep learning.

• We will concentrate on directions that have given rise to newmodels and architectures.

2

neural networks

machine learning as optimization

• Machine learning has many kinds of problems: supervised,unsupervised and so on.

• They are usually solved with Bayes’ Theorem, recomputing ourprior ideas about parameters into posteriors.

𝑝(𝜃|𝐷) = 𝑝(𝜃)𝑝(𝐷|𝜃)𝑝(𝐷) .

4

machine learning as optimization

• This usually means optimizing a multivariate function.• Algorithmically, machine learning is about approximation andoptimization of complicated functions (likelihoods, posteriors,error functions etc.).

• A deep neural network is a big and complicated function ofmany variables, highly nonconvex.

• For nonconvex functions this is usually done with some versionof gradient descent.

4

gradient descent

• Gradient descent is the main and virtually only way to optimizereally complicated functions.

• We take the gradient ∇𝐸(𝑤), move along it a little, and repeat.• And in neural networks gradients are easy to compute becauseNNs are just huge compositions of very simple elements.

5

perceptron

• Perceptron is the building block of neural networks:𝑦 = ℎ(𝑤⊤𝑥) = ℎ (∑𝑖 𝑤𝑖𝑥𝑖) .

• Different nonlinear functions:

6

uniting perceptrons into networks

• A network of perceptrons.• Hornik, 1990: two levels suffice to approximate any function.• But deep networks are more efficient — the power of distributedrepresentations.

• Usually neural networks are organized into layers.

7

computational graph

• Suppose we have represented a complex function as acomposition of simple ones.

• E.g., 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑥𝑦 + (𝑥 + 𝑦)2:

8

computational graph

• We can now compute derivatives on the computational graph.• Forward propagation: derivatives of every node by a given one(input).

• Backprop: derivatives of one node (output) by every other.• We can now compute gradients of any, arbitrarily complicatedfunctions.

• Backprop gets us derivatives of the error function for gradientdescent.

• theano, TensorFlow, Torch are automatic differentiation libraries.

8

gradient descent in nns

• A neural network has a lot of parameters:• regularization (𝐿2, 𝐿1, dropout);• initialization (unsupervised pretraining, random init);• normalization (batchnorm, layer normalization);• improved gradient descent (moments, Nesterov momentum,Adadelta, Adam).

• Two main classes of architectures, both from the 1980s:• convolutional neural networks;• recurrent neural networks.

9


• A neural network has a lot of parameters:• 𝐿2- and 𝐿1-regularization;• early stopping;• dropout.

10


• These parameters have to be initialized:• unsupervised pretraining;• random initialization:

• Xavier initialization (Glorot, Bengio, 2010);• He initialization (He et al., 2015);

• the idea is to control variances so that the gradients do not vanishalong the layers.

10


• Batch normalization:• covariate shift — next level inputs change when the previous onechanges;

• would be good to normalize over the whole dataset, but too hard;• hence, normalize by mini-batches, with additional parameters 𝛾, 𝛽for more expressivity.

10


• Gradient descent improvements:• gradually reducing learning rate;• method of moments, Nesterov momentum;• adaptive gradient descent (Adagrad, Adadelta, RMSprop, Adam).

10

convolutional neural networks

• Basic convolutional layer: convolution, nonlinearity, pooling.

• Classical architecture – LeNet:

11


• Convolutional nets are usually very deep (AlexNet):

• Usually trained on ImageNet (> 15M images, ≈ 22K classes).

11


• VGG: it suffices to take 3 × 3 convolutions:

• Network in Network, Inception:

11


• Residual connections (ResNet) are a very important idea; led toa “revolution of depth”.

11


• Other objective functions can be added (e.g., segmentation):

• In general, it is very easy to combine architectures: justconcatenate inputs or add parallel layers that do somethingnew.

• We will see lots of examples of this in NLP.

11

recurrent neural networks

• Various kinds of problems are based on sequences:

• A “simple” RNN:

12


• Bidirectional RNN:

• Two problems with RNN:• exploding gradients;• vanishing gradients.

12


• Solving exploding gradients is easy: simply clip the gradients.

12


• Vanishing gradients: you always kind of “multiply by 𝑊 ”, by thematrix of recurrent weights.

• This does not let us implement long-term memory.

• We need to let the gradients flow freely.

12


• Constant error carousel – LSTM:

12


• Or simplified constructions, e.g., GRU:

12

a typical recurrent network

• A simple seq2seq architecture (Sutskever et al. 2014):

13

a typical recurrent network

• A slighly more complicated seq2seq architecture:

13

autoencoders

• Autoencoders: how to do unsupervised learning with NNs.Reconstructing input by output:

• Why can’t the network simply copy input to output?

14

autoencoders

• Undercomplete vs. overcomplete autoencoders.• Regularization helps a lot, especially dropout.• Denoising autoencoders – add noise to input, reconstruct trueinput:

14

other stuff

• Many other directions we have not touched:• deep reinforcement learning;

• generative networks, in particular GANs;

• Bayesian inference and neural networks;• …

• But now let us proceed to NLP. 15

nlp problems

nlp problems

• Syntactic, more or less well-defined problems:• part-of-speech tagging;• morphological segmentation;• stemming or lemmatization;• sentence boundary disambiguation;• word segmentation;• named entity recognition;• word sense disambiguation;• syntactic parsing;• coreference resolution.

17

nlp problems

• But this also requires understanding in hard cases:

• Anaphora resolution:• «the laptop did not fit in the bag because it was too small»;• «the laptop did not fit in the bag because it was too big».

17

nlp problems

• More complex problems that require understanding even moreoften, but we still know the right answers and can get qualitymetrics:

• language models;• sentiment analysis;• relationship extraction, fact extraction;• question answering.

17

nlp problems

• Problems where we not only understand text but try to generatenew text:

• text generation per se;• automatic summarization;• machine translation;• dialog and conversational models.

• There are deep learning models for all these problems.

17

word embeddings,sentence embeddings,and character-level models

word embeddings

• Distributional hypothesis in linguistics: words with similarmeaning will occur in similar contexts.

• Distributed word representations map words to a Euclideanspace (usually of dimension several hundred):

• started in earnest in (Bengio et al. 2003; 2006), although therewere earlier ideas;

• word2vec (Mikolov et al. 2013): train weights that serve best forsimple prediction tasks between a word and its context:continuous bag-of-words (CBOW) and skip-gram;

• Glove (Pennington et al. 2014): train word weights to decomposethe (log) cooccurrence matrix.

19

word embeddings

• The CBOW word2vec model operates as follows:• inputs are one-hot word representations of dimension 𝑉 ;• the hidden layer is the matrix of vector embeddings 𝑊 ;• the hidden layer’s output is the average of input vectors;• as output we get an estimate 𝑢𝑗 for each word, and

�̂�(𝑖|𝑐1, … , 𝑐𝑛) = exp(𝑢𝑗)∑𝑉

𝑗′=1 exp(𝑢𝑗′ ).

19

word embeddings

• In skip-gram, it’s the opposite:• we predict each context word from the central word;• so now there are several multinomial distributions, one softmaxfor each context word:

�̂�(𝑐𝑘|𝑖) =exp(𝑢𝑘𝑐𝑘 )

∑𝑉𝑗′=1 exp(𝑢𝑗′ )

.

19

word embeddings

• How do we train a model like that?• E.g., in skip-gram we choose 𝜃 to maximize

𝐿(𝜃) = ∏𝑖∈𝐷

⎛⎜⎝

∏𝑐∈𝐶(𝑖)

𝑝(𝑐 ∣ 𝑖; 𝜃)⎞⎟⎠

= ∏(𝑖,𝑐)∈𝐷

𝑝(𝑐 ∣ 𝑖; 𝜃),

and we parametrize

𝑝(𝑐 ∣ 𝑖; 𝜃) = exp(�̃�⊤𝑐 𝑤𝑖)

∑𝑐′ exp(�̃�⊤𝑐′𝑤𝑖)

.

20

word embeddings

• This leads to the total likelihood

arg max𝜃

∏(𝑖,𝑐)∈𝐷

𝑝(𝑐 ∣ 𝑖; 𝜃) = arg max𝜃

∑(𝑖,𝑐)∈𝐷

𝑝(𝑐 ∣ 𝑖; 𝜃) =

= arg max𝜃

∑(𝑖,𝑐)∈𝐷

(exp(�̃�⊤𝑐 𝑤𝑖) − log ∑

𝑐′exp(�̃�⊤

𝑐′𝑤𝑖)) ,

which we maximize with negative sampling.• Question: why do we need separate �̃� and 𝑤 vectors?• Live demo: nearest neighbors, simple geometric relations.

20

glove

• GloVe – we are trying to approximate the cooccurrence matrix𝑋 ∈ ℝ𝑉 ×𝑉 :

𝑝𝑖𝑗 = 𝑝(𝑗 ∣ 𝑖) = 𝑋𝑖𝑗𝑋𝑖

= 𝑋𝑖𝑗∑𝑘 𝑋𝑖𝑘

.

• More precisely, the ratios 𝑝𝑖𝑗𝑝𝑘𝑗.

• Example from the Russian wiki:No. of occurrences Probabilities Ratio

Word 𝑘 Total Together with: 𝑝(𝑘 ∣ …), ×10−4 𝑝(𝑘∣клуб)𝑝(𝑘∣команда)

клуб команда клуб командафутбол 29988 54 34 18.0 11.3 1.588хоккей 10957 16 7 6.39 14.6 2.286гольф 2721 11 1 40.4 3.68 11.0корабль 100127 0 30 0.0 3.00 0.0

21

glove

• We train the function 𝐹(𝑤𝑖, 𝑤𝑗; �̃�𝑘) = 𝑝𝑖𝑗𝑝𝑘𝑗

.• Even simpler – we train

𝐹((𝑤𝑖 − 𝑤𝑗)⊤ �̃�𝑘) = 𝐹 (𝑤⊤

𝑖 �̃�𝑘)𝐹 (𝑤⊤

𝑗 �̃�𝑘)= 𝑝𝑖𝑗

𝑝𝑘𝑗.

• This should actually be an exponent:

𝑤⊤𝑖 �̃�𝑘 = log(𝑝𝑖𝑘) = log(𝑋𝑖𝑘) − log(𝑋𝑖).

• We can hide log(𝑋𝑖) in bias terms 𝑏𝑖.• And the objective function for GloVe will be

𝐽 =𝑉

∑𝑖,𝑗=1

𝑓(𝑋𝑖𝑗) (𝑤⊤𝑖 �̃�𝑗 + 𝑏𝑖 + �̃�𝑗 − log 𝑋𝑖𝑗)

2 .

21

how to use word vectors

• Next we can use recurrent architectures on top of word vectors.• E.g., LSTMs for sentiment analysis:

• Train a network of LSTMs for language modeling, then use eitherthe last output or averaged hidden states for sentiment.

22

how to use word vectors

• Or a CNN with one-dimensional convolutions:

22

up and down from word embeddings

• Word embeddings are the first step of most DL models in NLP.• But we can go both up and down from word embeddings.• First, a sentence is not necessarily the sum of its words.• Second, a word is not quite as atomic as the word2vec modelwould like to think.

23

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• The simplest idea is to use the sum and/or mean of wordembeddings to represent a sentence/paragraph:

• a baseline in (Le and Mikolov 2014);• a reasonable method for short phrases in (Mikolov et al. 2013)• shown to be effective for document summarization in (Kageback etal. 2014).

24

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• Distributed Memory Model of Paragraph Vectors (PV-DM) (Le andMikolov 2014):

• a sentence/paragraph vector is an additional vector for eachparagraph;

• acts as a “memory” to provide longer context;

24

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• Distributed Bag of Words Model of Paragraph Vectors (PV-DBOW)(Le and Mikolov 2014):

• the model is forced to predict words randomly sampled from aspecific paragraph;

• the paragraph vector is trained to help predict words from thesame paragraph in a small window.

24

sentence embeddings

• How do we combine word vectors into “text chunk” vectors?• A number of convolutional architectures (Ma et al., 2015;Kalchbrenner et al., 2014).

• (Kiros et al. 2015): skip-thought vectors capture the meanings ofa sentence by training from skip-grams constructed onsentences.

• (Djuric et al. 2015): model large text streams with hierarchicalneural language models with a document level and a token level.

24

recursive neural networks

• Recursive neural networks (Socher et al., 2012):• a neural network composes a chunk of text with another part in atree;

• works its way up from word vectors to the root of a parse tree.

25

recursive neural networks

• Recursive neural networks (Socher et al., 2012):• representing a node with a matrix-vector pair;• by training this in a supervised way, one can get a very effectiveapproach to sentiment analysis (Socher et al. 2013).

25

deep recursive networks

• Deep recursive networks for sentiment analysis (Irsoy, Cardie,2014).

• First idea: decouple leaves and internal nodes.• In recursive networks, we apply the same weights throughoutthe tree:

𝑥𝑣 = 𝑓(𝑊𝐿𝑥𝑙(𝑣) + 𝑊𝑅𝑥𝑟(𝑣) + 𝑏).• Now, we use different matrices for leaves (input words) andhidden nodes:

• we can now have fewer hidden units than the word vectordimension;

• we can use ReLU: sparse inputs and dense hidden units do notcause a discrepancy.

26

deep recursive networks

• Second idea: add depth to get hierarchical representations:

ℎ(𝑖)𝑣 = 𝑓(𝑊 (𝑖)

𝐿 ℎ(𝑖)𝑙(𝑣) + 𝑊 (𝑖)

𝑅 ℎ(𝑖)𝑟(𝑣) + 𝑉 (𝑖)ℎ(𝑖−1)

𝑣 + 𝑏(𝑖)).

• Great architecture for sentiment analysis... if you have the parsetrees.

• Stanford Sentiment TreeBank for English. 26

character-level models

• Word embeddings have important shortcomings:• vectors are independent but words are not; consider, in particular,morphology-rich languages like Russian;

• the same applies to out-of-vocabulary words: a word embeddingcannot be extended to new words;

• word embedding models may grow large; it’s just lookup, but thewhole vocabulary has to be stored in memory with fast access.

• E.g., “polydistributional” gets 48 results on Google, so youprobably have never seen it, and there’s very little training data:

• Do you have an idea what it means? Me too.27


• Hence, character-level representations:• began by decomposing a word into morphemes (Luong et al. 2013;Botha and Blunsom 2014; Soricut and Och 2015);

• but this adds errors since morphological analyzers are alsoimperfect, and basically a part of the problem simply shifts totraining a morphology model;

• two natural approaches on character level: LSTMs and CNNs;• in any case, the model is slow but we do not have to apply it toevery word, we can store embeddings of common words in alookup table as before and only run the model for rare words – anice natural tradeoff.

27


• C2W (Ling et al. 2015) is based on bidirectional LSTMs:

27


• The approach of Deep Structured Semantic Model (DSSM)(Huang et al., 2013; Gao et al., 2014a; 2014b):

• sub-word embeddings: represent a word as a bag of trigrams;• vocabulary shrinks to |𝑉 |3 (tens of thousands instead of millions),but collisions are very rare;

• the representation is robust to misspellings (very important foruser-generated texts).

27


• ConvNet (Zhang et al. 2015): text understanding from scratch,from the level of symbols, based on CNNs.

• Character-level models and extensions to appear to be veryimportant, especially for morphology-rich languages likeRussian.

27

modern char-based language model: kim et al., 2015

Sample modern character-based language model (Kim et al., 2015):

Unites CNN, RNN, highway networks, embeddings...

28

word vectors with external information

• Other modifications of word embeddings add externalinformation.

• E.g., the RC-NET model (Xu et al. 2014) extends skip-grams withrelations (semantic and syntactic) and categorical knowledge(sets of synonyms, domain knowledge etc.).

29

word vectors with external information

• The basic word2vec model gets a regularizer for every relationthat tries to bring it closer to a linear relation between thevectors, so that, e.g.,

𝑤Hinton − 𝑤Wimbledon ≈ 𝑟born at ≈ 𝑤Euler − 𝑤Basel

29

word sense disambiguation

• Another important problem with both word vectors andchar-level models: homonyms.

• How do we distinguish different senses of the same word?• the model usually just chooses one meaning;• e.g., let’s check nearest neighbors for the word коса and otherhomonyms.

• We have to add latent variables for different meaning and inferthem from context.

• To train the meanings with latent variables — Bayesian inferencewith stochastic variational inference (Bartunov et al., 2015).

30

general approaches

text generation with rnns

• Language modeling and text generation is a natural directapplication of NN-based NLP; word embeddings started as a“neural probabilistic language model” (Bengio et al., 2003).

• First idea – sequence learning with RNNs/LSTMs.• Surprisingly, simple RNNs can produce quite reasonably-lookingtext even by training character by character, with no knowledgeof the words (“The Unreasonable Effectiveness...”), including thefamous example from (Sutskever et al. 2011):The meaning of life is the tradition of the ancient human reproduction: it is lessfavorable to the good boy for when to remove her bigger...

• This is, of course, not “true understanding” (whatever thatmeans), only short-term memory effects.

• We need to go deeper in terms of both representations andsequence modeling.

32

text generation with rnns

• One can change diversity (sampling temperature) and get different styles ofabsurdistic texts. Random example with seed «обещал, на рождество, носпустя семь лет. имя глав».

• Low diversity gets you Lucky’s monologue from “Waiting for Godot”:

обещал, на рождество, но спустя семь лет. имя главного командования в составесоветского союза с 1976 года. после проведения в 1992 году в составе советскогосоюза (1977). в 1967 году в составе советского союза состоялся в 1952 году в составесоветско-финской войны 1877 - 1877 годов. в 1966 году в составе советского союзас 1965 года по 1977 год...

• Mid-range diversity produces something close to meaningful text:

обещал, на рождество, но спустя семь лет. имя главного рода собственно вновьобразовалась в россии и народном состоянии. после присказа с постановлениемсоюза писателей россии и генеральной диссертации о спортивном училище с 1980года. в 1970-х годах был основан в составе комитета высшего совета театра впольши. в 1957 - 1962 годах - начальник батальона сан-аухаров...

• High diversity leads to Khlebnikov’s zaum:

обещал, на рождество, но спустя семь лет. имя главы философии пововпели nol-lнози - врайу-7 на луосече. человеческая восстания покторов извоенного чомпдеи э. дроссенбурга, … карл уним-общекрипских. эйелем хфечак от этого спискасравнивала имущно моря в юнасториансический индристское носительскихженатов в церкви испании.... 33

text generation with rnn

• Another example – «12 chairs», 3 layers of LSTM of dim. 128.

• Low diversity:

– вы думаете, что он подвергается опасности? не понимаете на девушки и со всегобольшого секретара. на поставитель из столики с колодции под собой по столовомпод нарипальное одного обедать вы получить стулья. но все не собирался. подводой под события не подошел к двери. он серебренной при столики под водомворобьяниновской порочение и подошел к стулом.

• Average diversity:

– что это значит?– спросил ипполит матвеевич, вспоминая только подкладка,идиость выкрасть, что уже совершенно всего упасы, по рексе оборанный решалина ним ответственное колоно горячи облиганта ветерность ”правосудель” застояли пределицу и из одобрания из на порахнитостью. но кричался воему тогу.его не смотрел ордеров с мы толстений принимать выдержание то преходитель.

• High diversity:

– ну, и я вы умоли полтуча,– сказал остап, нади гадалкий во столбор не черта ненадо предражало. ответил золотый и стулья и нов. срековое зарабоварил стооспастук, и обычно, и строи тираживым господура моя животую столу, почто неуличного беспарные такиме судьберского есть денегальный извер.

34

text generation with rnn

• Last example – 3 layers of dim. 128, «Eugene Onegin».

• Low diversity – basically reciting:

но вот уж близко. перед нимиуж белокаменной москвыкак жар, крестами золотымигорят старинные главы.

• Average diversity – large verbatim quotes:

не правда ль? вам была не новостьсмиренной девочки, поврамаон был любим... по крайней меретак думал он на супруге.

• High diversity – Khlebnikov again:

простой живеть по полном в,бал уж канит; три несаладо глаза подерень преданьемпоедет, смертаю себя.

35

poroshok

• And here are some poroshki generated with LSTMs from arelatively small dataset:

заходит к солнцу отдаётесьчто он летел а может бытьи вовсе не веду на стенкена поле пять и новый годи почему то по башкев квартире голуби и болии повзрослел и умирать

страшней всего когда ты выпилбез показания зонта

однажды я тебя не вышлои ты

я захожу в макдоналистунадену отраженный дождьпод ужин почему местамии вдруг подставил человек

ты мне привычно верил крышудо дна

я подползает под кроватьючтоб он исписанный пингвини ты мне больше никогдано мы же после русских классикбарто солдаты для любви

36

dssm

• A general approach to NLP based on CNNs is given by DeepStructured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,2014a; 2014b):

• one-hot target vectors for classification (speech recognition, imagerecognition, language modeling).

37

dssm


• vector-valued targets for semantic matching.

37

dssm


• can capture different targets (one-hot, vector);• to train with vector targets – reflection: bring source and targetvectors closer.

37

dssm


• DSSMs can be applied in a number of different contexts whenwe can specify a supervised dataset:

• semantic word embeddings: word by context;• web search: web documents by query;• question answering: knowledge base relation/entity by pattern;• recommendations: interesting documents by read/likeddocuments;

• translation: target sentence by source sentence;• text/image: labels by images or vice versa.

• Basically, this is an example of a general architecture that canbe trained to do almost anything.

37

dssm


• Deep Structured Semantic Models (DSSM) (Huang et al., 2013;Gao et al., 2014a; 2014b): a deep convolutional architecturetrained on similar text pairs.

• Can be used for information retrieval: model relevance bybringing relevant documents closer to their queries (bothdocument and query go through the same convolutionalarchitecture).

• November 2016: a post by Yandex saying that they use(modified) DSSM in their new Palekh search algorithm.

37

dependency parsing

dependency parsing

• We mentioned parse trees; but how do we construct them?

39

dependency parsing

• Current state of the art – continuous-state parsing: current stateis encoded in ℝ𝑑.

• Stack LSTMs (Dyer et al., 2015) – the parser manipulates threebasic data structures:(1) a buffer 𝐵 contains the sequence of words;(2) a stack 𝑆 stores partially constructed parses;(3) a list 𝐴 of actions already taken by the parser;

39

dependency parsing with morphology

• Important extension – (Ballesteros et al., 2015):• in morphologically rich natural languages, we have to take intoaccount morphology;

• so they represent the words by bidirectional character-level LSTMs;• report improved results in Arabic, Basque, French, German,Hebrew, Hungarian, Korean, Polish, Swedish, and Turkish;

• this direction probably can be further improved (and where’sRussian in the list above?..).

40

evaluation for sequence-to-sequence models

• Next we will consider specific models for machine translation,dialog models, and question answering.

• But how do we evaluate NLP models that produce text?• Quality metrics for comparing with reference sentencesproduced by humans:

• BLEU (Bilingual Evaluation Understudy): reweighted precision (incl.multiple reference translations);

• METEOR: harmonic mean of unigram precision and unigram recall;• TER (Translation Edit Rate): number of edits between the outputand reference divided by the average number of reference words;

• LEPOR: combine basic factors and language metrics with tunableparameters.

• The same metrics apply to paraphrasing and, generally, allproblems where the (supervised) answer should be a free-formtext.

• There is one problem...

41

evaluation for sequence-to-sequence models

• They don’t work at all!

• Well, actually they do work, but it’s more complicated than itseems.

41

machine translation

machine translation

• Translation is a very convenient problem for modern NLP:• on one hand, it is very practical, obviously important;• on the other hand, it’s very high-level, virtually impossible withoutdeep understanding, so if we do well on translation, we probablydo something right about understanding;

• on the third hand (oops), it’s quantifiable (BLEU, TER etc.) and hasrelatively large available datasets (parallel corpora).

43

machine translation

• Statistical machine translation (SMT): model conditionalprobability 𝑝(𝑦 ∣ 𝑥) of target 𝑦 (translation) given source 𝑥 (text).

• Classical SMT: model log 𝑝(𝑦 ∣ 𝑥) with a linear combination offeatures and then construct these features.

• NNs have been used both for reranking the best lists of possibletranslations and as part of feature functions:

43

machine translation

• NNs are still used for feature engineering with state of the artresults, but here we are more interested insequence-to-sequence modeling.

• Basic idea:• RNNs can be naturally used to probabilistically model a sequence

𝑋 = (𝑥1, 𝑥2, … , 𝑥𝑇 ) as 𝑝(𝑥1), 𝑝(𝑥2 ∣ 𝑥1), …,𝑝(𝑥𝑇 ∣ 𝑥<𝑇 ) = 𝑝(𝑥𝑇 ∣ 𝑥𝑇−1, … , 𝑥1), and then the joint probability𝑝(𝑋) is just their product𝑝(𝑋) = 𝑝(𝑥1)𝑝(𝑥2 ∣ 𝑥1) … 𝑝(𝑥𝑘 ∣ 𝑥<𝑘) … 𝑝(𝑥𝑇 ∣ 𝑥<𝑇 );

• this is how RNNs are used for language modeling;• we predict next word based on the hidden state learned from allprevious parts of the sequence;

• How do we apply this idea to SMT?

43

encoder-decoder architectures

• Encoder-decoder architectures (Sutskever et al., 2014; Cho et al.,2014):

• First code, then decode back.

44

encoder-decoder architectures

• The same idea works with translation.

• Problem: we need to compress the entire sentence into a singlevector.

• And it does not work at all with longer fragments...

44

attention in neural networks

• Possible solution: train special weights that show howimportant a certain part of the input is for the currentlygenerated part of the output.

• This is somewhat similar to human attention: what do we putinto working memory?

• First applications in NNs – foveal glimpses with RBMs(Larochelle, Hinton, 2010)

45


• A direct application – bidirectional LSTM + attention (Bahdanauet al. 2014):

45


• Soft attention (Luong et al. 2015a; 2015b; Jean et al. 2015):• encoder is a bidirectional RNN;• attention network estimates relevance: are we translating thisword right now?

45


• Formally very simple: compute attention weights 𝛼𝑡𝑗 andre-weigh context vectors:

𝑒𝑡𝑗 = 𝑎(𝑧𝑡−1, 𝑗), 𝛼𝑡𝑗 = softmax(𝑒𝑡𝑗; 𝑒𝑡∗),𝑐𝑡 = ∑

𝑗𝛼𝑡𝑗ℎ𝑗, and now 𝑧𝑡 = 𝑓(𝑠𝑡−1, 𝑦𝑡−1, 𝑐𝑖).

45


• As a result we can visualize what the network is looking at:

45


• The word order is much better this way:

45


• Other applications of attention mechanisms can be NLP-relatedtoo.

• Show, Attend, and Tell (Xu et al., 2015): descriptions of images.

45


• Soft attention vs. hard attention (stochastically choose aspecific part of the image):

• Hard attention is trained by maximizing a variational lowerbound.

45


• Often pretty good results:

45


• Even closer – «Grammar as a Foreign Language» (Vinyals et al.,2015)

45

google translate

• September 2016: Wu et al., Google’s Neural Machine TranslationSystem: Bridging the Gap between Human and MachineTranslation:

• shows how Google Translate actually works;• the basic architecture is the same: encoder, decoder, attention;• RNNs have to be deep enough to capture language irregularities,so 8 layers for encoder and decoder each:

46

google translate


• but stacking LSTMs does not really work: 4-5 layers are OK, 8layers don’t work;

• so they add residual connections between the layers, similar to(He, 2015):

46

google translate


• and it makes sense to make the bottom layer bidirectional inorder to capture as much context as possible:

46

google translate


• GNMT also uses two ideas for word segmentation:• wordpiece model: break words into wordpieces (with a separatemodel); example from the paper:

Jet makers feud over seat width with big orders at stake

becomes

_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

• mixed word/character model: use word model but forout-of-vocabulary words convert them into characters (specificallymarked so that they cannot be confused); example from the paper:

Miki becomes <B>M <M>i <M>k <E>i46

teaching machines to read

• (Hermann et al., 2015): «Teaching machines to read andcomprehend» (Google DeepMind)

• A new way to construct a dataset for understanding byautomated construction of (context, query, answer) triples fromnews items or similar texts.

47


• The model is based on a deep network of LSTMs:

• But it does not work very well this way.

47


• Attentive Reader – train which part of the document to look at:

47


• Impatient Reader – re-read parts of the document as the queryis processed:

47


• Reasonable attention maps:

47

attention is all you need

• June 2017: «Attention is all you need» (Vaswani et al., Google)

48

attention is all you need

• Nothing but attention!• Parallel attention maps are combined into matrices:

• Self-attention: each encoder position can “attend to” eachposition of the previous level.

• SMT results improve over state of the art, and training is 100xfaster.

48

dialog and conversation

dialog and conversational models

• Dialog models attempt to model and predict dialogue;conversational models actively talk to a human.

• Applications – automatic chat systems for business etc.• Vinyals and Le (2015) use seq2seq (Sutskever et al. 2014):

• feed previous sentences ABC as context to the RNN;• predict the next word of reply WXYZ based on the previous wordand hidden state.

• Datasets: general (MovieSubtitles) or domain-specific (IThelpdesk). 50


• Hierarchical recurrent encoder decoder architecture (HRED); firstproposed for query suggestion in IR (Sordoni et al. 2015), usedfor dialog systems in (Serban et al. 2015).

• The dialogue as a two-level system: a sequence of utterances,each of which is in turn a sequence of words. To model thistwo-level system, HRED trains:(1) encoder RNN that maps each utterance in a dialogue into a single

utterance vector;(2) context RNN that processes all previous utterance vectors and

combines them into the current context vector;(3) decoder RNN that predicts the tokens in the next utterance, one at

a time, conditional on the context RNN.

50


• HRED architecture:

50


• Some recent developments:• (Li et al., 2016a) apply, again, reinforcement learning (DQN) toimprove dialogue generation;

• (Li et al., 2016b) add personas with latent variables, so dialoguecan be more consistent (yes, it’s the same Li);

• (Wen et al., 2016) use snapshot learning, adding some weaksupervision in the form of particular events occurring in theoutput sequence (whether we still want to say something or havealready said it);

• (Su et al., 2016) improve dialogue systems with online activereward learning, a tool from reinforcement learning.

• Generally, chatbots are becoming commonplace but it is still along way to go before actual general-purpose dialogue.

50

question answering

question answering

• Question answering (QA) is one of the hardest NLP challenges,close to true language understanding.

• Let us begin with evaluation:• it’s easy to find datasets for information retrieval;• these questions can be answered knowledge base approaches:map questions to logical queries over a graph of facts;

• in a multiple choice setting (Quiz Bowl), map the question andpossible answers to a semantic space and find nearest neighbors(Socher et al. 2014);

• but this is not exactly general question answering.

• (Weston et al. 2015): a dataset of simple (for humans) questionsthat do not require any special knowledge.

• But require reasoning and understanding of semantic structure...

52

question answering

• Sample questions:Task 1: Single Supporting Fact

Mary went to the bathroom.John moved to the hallway.Mary travelled to the office.Where is Mary?A: office

Task 4: Two Argument RelationsThe office is north of the bedroom.The bedroom is north of the bathroom.The kitchen is west of the garden.What is north of the bedroom? A: officeWhat is the bedroom north of? A: bathroom

Task 7: CountingDaniel picked up the football.Daniel dropped the football.Daniel got the milk.Daniel took the apple.How many objects is Daniel holding?A: two

Task 10: Indefinite KnowledgeJohn is either in the classroom or the playground.Sandra is in the garden.Is John in the classroom?A: maybeIs John in the office?A: no

Task 15: Basic DeductionSheep are afraid of wolves.Cats are afraid of dogs.Mice are afraid of cats.Gertrude is a sheep.What is Gertrude afraid of?A: wolves

Task 20: Agent’s MotivationsJohn is hungry.John goes to the kitchen.John grabbed the apple there.Daniel is hungry.Where does Daniel go? A: kitchenWhy did John go to the kitchen? A: hungry

52

question answering

• Naive approach – generate representations for question andanswer and combine them:

• One problem is that we have to remember the context setthroughout the whole question... 52

question answering

• ...so the current state of the art are memory networks (Weston etal. 2014).

• An array of objects (memory) and the following componentslearned during training:

I (input feature map) converts the input to the internal featurerepresentation;

G (generalization) updates old memories after receiving new input;O (output feature map) produces new output given a new input anda memory state;

R (response) converts the output of O into the output responseformat (e.g., text).

52

question answering

• Dynamic memory networks (Kumar et al. 2015).• Episodic memory unit that chooses which parts of the input tofocus on with an attention mechanism:

52

question answering

• End-to-end memory networks (Sukhbaatar et al. 2015).• A continuous version of memory networks, with multiple hops(computational steps) per output symbol.

• Regular memory networks require supervision on each layer;end-to-end ones can be trained with input-output pairs:

52

question answering

• Should work like this:

52

question answering

• Generally, the idea of training control algorithms is veryimportant now.

• Neural Turing Machines (Graves et al., 2014):

• Read/write/erase operations on memory with attention-likeweights.

52

question answering

• The weights are a kind of addressing mechanism:

• Can be trained with reinforcement learning (Zaremba, Sutskever,2015):

52

question answering

• There are plenty of other extensions; one problem is how to linkQA systems with knowledge bases to answer questions thatrequire both reasoning and knowledge.

• I expect a lot of exciting things to happen here.• But allow me to suggest...

52

what? where? when?

• «What? Where? When?»: a team game of answering questions.Sometimes it looks like this...

53

what? where? when?

• ...but usually it looks like this:

53

what? where? when?

• Teams of ≤ 6 players answer questions, whoever gets the mostcorrect answers wins.

• db.chgk.info – database of about 300K questions.• Some of them come from “Своя игра”, a Jeopardy clone butoften with less direct questions:

• ЛоготипыЕГО логотип — сочетание рун «беркана» и «хаглаз», инициаловконунга Харальда.

• СокращенияУчитывая ЕГО большую роль в создании первых танков, этитанки часто шутливо именовали ватерклозетами.

• ЛошадиЕГО попытка приударить за галисийскими кобылицамиокончилась печальным образом.

53

what? where? when?

• Most are “Что? Где? Когда?” questions, even harder forautomated analysis:

• Однажды Ричард Фейнман объяснял своей девушке, почему Декартговорил глупости, когда доказывал существование Бога. Не желаяввязываться в спор, девушка сказала, что на любой предмет, видимо,можно посмотреть с разных сторон. Ответьте двумя словами: чтосделал Фейнман в ответ на это?

• Вторая соответствовала богохульству, третья — подделке денег илидокументов, четвёртая — пьянству. А чему соответствовала первая?

• На самом деле чехословацкая линейка аудиотехники ирадиокомпонентов получила название потому, что это слаботочнаятехника. Напишите это название.

• Одна из конструкций математической логики рассматриваетбессмертного агента, который в каждый момент времени должен либосовершить действие, либо формально доказать, что он совершит этодействие в один из последующих моментов времени. Название этойконструкции состоит из двух слов, начинающихся на одну и ту жебукву. Напишите это название.

• I believe it is a great and very challenging QA dataset.• How far in the future do you think it is? :)

53

thank you!

Thank you for your attention!

54

deep learning for natural language processingsergey/slides/n17_aistdlnlp.pdf · deeplearning...

Documents