deep learning for natural language processing sergey/slides/n17_ ¢  deeplearning...

Download Deep Learning for Natural Language Processing sergey/slides/N17_ ¢  deeplearning fornaturallanguageprocessing

Post on 03-Jun-2020

3 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • deep learning for natural language processing

    Sergey I. Nikolenko1

    AIST 2017 Moscow, July 28, 2017

    1Steklov Institute of Mathematics at St. Petersburg

    Random facts:

    • on July 28, 1814, Percy Bysshe Shelley, married to Harriet Westbrook at the time, ran from England to France with a 16-year-old Mary Wollstonecraft Godwin, later Mary Shelley

    • on July 28, 1974, KGB chairman Yuri Andropov created the Alpha Group (Spetsgruppa A)

    • July 28 -- Russia's Baptism Day; the Duma passed this bill on May 21, 2010

    • July 28 -- World Hepatitis Day; World Health Assembly passed this resolution on May 21, 2010

  • plan

    • The deep learning revolution has not left natural language processing alone.

    • DL in NLP has started with standard architectures (RNN, CNN) but then has branched out into new directions.

    • Our plan for today: (1) very brief intro to neural networks and deep learning; (2) intro to distributed word representations; (3) a primer on sentence embeddings and character-level models; (4) a ((very-)very) brief overview of the most promising directions in

    modern NLP based on deep learning.

    • We will concentrate on directions that have given rise to new models and architectures.

    2

  • neural networks

  • machine learning as optimization

    • Machine learning has many kinds of problems: supervised, unsupervised and so on.

    • They are usually solved with Bayes’ Theorem, recomputing our prior ideas about parameters into posteriors.

    𝑝(𝜃|𝐷) = 𝑝(𝜃)𝑝(𝐷|𝜃)𝑝(𝐷) .

    4

  • machine learning as optimization

    • This usually means optimizing a multivariate function. • Algorithmically, machine learning is about approximation and optimization of complicated functions (likelihoods, posteriors, error functions etc.).

    • A deep neural network is a big and complicated function of many variables, highly nonconvex.

    • For nonconvex functions this is usually done with some version of gradient descent.

    4

  • gradient descent

    • Gradient descent is the main and virtually only way to optimize really complicated functions.

    • We take the gradient ∇𝐸(𝑤), move along it a little, and repeat. • And in neural networks gradients are easy to compute because NNs are just huge compositions of very simple elements.

    5

  • perceptron

    • Perceptron is the building block of neural networks: 𝑦 = ℎ(𝑤⊤𝑥) = ℎ (∑𝑖 𝑤𝑖𝑥𝑖) .

    • Different nonlinear functions:

    6

  • uniting perceptrons into networks

    • A network of perceptrons. • Hornik, 1990: two levels suffice to approximate any function. • But deep networks are more efficient — the power of distributed representations.

    • Usually neural networks are organized into layers.

    7

  • computational graph

    • Suppose we have represented a complex function as a composition of simple ones.

    • E.g., 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑥𝑦 + (𝑥 + 𝑦)2:

    8

  • computational graph

    • We can now compute derivatives on the computational graph. • Forward propagation: derivatives of every node by a given one (input).

    • Backprop: derivatives of one node (output) by every other. • We can now compute gradients of any, arbitrarily complicated functions.

    • Backprop gets us derivatives of the error function for gradient descent.

    • theano, TensorFlow, Torch are automatic differentiation libraries.

    8

  • gradient descent in nns

    • A neural network has a lot of parameters: • regularization (𝐿2, 𝐿1, dropout); • initialization (unsupervised pretraining, random init); • normalization (batchnorm, layer normalization); • improved gradient descent (moments, Nesterov momentum, Adadelta, Adam).

    • Two main classes of architectures, both from the 1980s: • convolutional neural networks; • recurrent neural networks.

    9

  • gradient descent in nns

    • A neural network has a lot of parameters: • 𝐿2- and 𝐿1-regularization; • early stopping; • dropout.

    10

  • gradient descent in nns

    • These parameters have to be initialized: • unsupervised pretraining; • random initialization:

    • Xavier initialization (Glorot, Bengio, 2010); • He initialization (He et al., 2015);

    • the idea is to control variances so that the gradients do not vanish along the layers.

    10

  • gradient descent in nns

    • Batch normalization: • covariate shift — next level inputs change when the previous one changes;

    • would be good to normalize over the whole dataset, but too hard; • hence, normalize by mini-batches, with additional parameters 𝛾, 𝛽 for more expressivity.

    10

  • gradient descent in nns

    • Gradient descent improvements: • gradually reducing learning rate; • method of moments, Nesterov momentum; • adaptive gradient descent (Adagrad, Adadelta, RMSprop, Adam).

    10

  • convolutional neural networks

    • Basic convolutional layer: convolution, nonlinearity, pooling.

    • Classical architecture – LeNet:

    11

  • convolutional neural networks

    • Convolutional nets are usually very deep (AlexNet):

    • Usually trained on ImageNet (> 15M images, ≈ 22K classes).

    11

  • convolutional neural networks

    • VGG: it suffices to take 3 × 3 convolutions:

    • Network in Network, Inception:

    11

  • convolutional neural networks

    • Residual connections (ResNet) are a very important idea; led to a “revolution of depth”.

    11

  • convolutional neural networks

    • Other objective functions can be added (e.g., segmentation):

    • In general, it is very easy to combine architectures: just concatenate inputs or add parallel layers that do something new.

    • We will see lots of examples of this in NLP.

    11

  • recurrent neural networks

    • Various kinds of problems are based on sequences:

    • A “simple” RNN:

    12

  • recurrent neural networks

    • Bidirectional RNN:

    • Two problems with RNN: • exploding gradients; • vanishing gradients.

    12

  • recurrent neural networks

    • Solving exploding gradients is easy: simply clip the gradients.

    12

  • recurrent neural networks

    • Vanishing gradients: you always kind of “multiply by 𝑊 ”, by the matrix of recurrent weights.

    • This does not let us implement long-term memory.

    • We need to let the gradients flow freely.

    12

  • recurrent neural networks

    • Constant error carousel – LSTM:

    12

  • recurrent neural networks

    • Or simplified constructions, e.g., GRU:

    12

  • a typical recurrent network

    • A simple seq2seq architecture (Sutskever et al. 2014):

    13

  • a typical recurrent network

    • A slighly more complicated seq2seq architecture:

    13

  • autoencoders

    • Autoencoders: how to do unsupervised learning with NNs. Reconstructing input by output:

    • Why can’t the network simply copy input to output?

    14

  • autoencoders

    • Undercomplete vs. overcomplete autoencoders. • Regularization helps a lot, especially dropout. • Denoising autoencoders – add noise to input, reconstruct true input:

    14

  • other stuff

    • Many other directions we have not touched: • deep reinforcement learning;

    • generative networks, in particular GANs;

    • Bayesian inference and neural networks; • …

    • But now let us proceed to NLP. 15

  • nlp problems

  • nlp problems

    • Syntactic, more or less well-defined problems: • part-of-speech tagging; • morphological segmentation; • stemming or lemmatization; • sentence boundary disambiguation; • word segmentation; • named entity recognition; • word sense disambiguation; • syntactic parsing; • coreference resolution.

    17

  • nlp problems

    • But this also requires understanding in hard cases:

    • Anaphora resolution: • «the laptop did not fit in the bag because it was too small»; • «the laptop did not fit in the bag because it was too big».

    17

  • nlp problems

    • More complex problems that require understanding even more often, but we still know the right answers and can get quality metrics:

    • language models; • sentiment analysis; • relationship extraction, fact extraction; • question answering.

    17

  • nlp problems

    • Problems where we not only understand text but try to generate new text:

    • text generation per se; • automatic summarization; • machine translation; • dialog and conversational models.

    • There are deep learning models for all these problems.

    17

  • word embeddings, sentence embeddings, and character-level models

View more >