neural models in nlp - ut · neural models in nlp ... cs231n convolutional neural networks for...

41
Neural models in NLP Natural Language Processing: Lecture 4 28.09.2017 Kairit Sirts

Upload: phamcong

Post on 08-Aug-2018

249 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Neural models in NLPNatural Language Processing: Lecture 4

28.09.2017

Kairit Sirts

Page 2: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

The goal of today’s lecture

• Explain word embeddings

• Explain the recurrent neural models used in NLP

2

Page 3: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Log-linear language model

y – the next word to predict

x – the context sequence: words, annotations etc

v – model parameters

f(x, y) – feature vector for the input-output pair (x, y)

3

Page 4: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

The problem with log-linear models

Feature engineering

• Developing feature templates

• Which features are relevant to which problems?

• Experiment with subsets of features

• Features can be very complex

4

Page 5: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

What if we could let the model learn the relevant features automatically?

Neural networks

5

Page 6: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

1-hot representation

the girl with flowers is cute are were flower … …

The 1 0 0 0 0 0 0 0 0 0 0

girl 0 1 0 0 0 0 0 0 0 0 0

with 0 0 1 0 0 0 0 0 0 0 0

the 1 0 0 0 0 0 0 0 0 0 0

flowers 0 0 0 1 0 0 0 0 0 0 0

is 0 0 0 0 1 0 0 0 0 0 0

cute 0 0 0 0 0 1 0 0 0 0 0

… … … … … … … … … … … …

flower 0 0 0 0 0 0 0 0 1 0 0

6

Page 7: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

What is the similarity between vectors for flower and flowers?

flowers 0 0 0 1 0 0 0 0 0 0 0

flower 0 0 0 0 0 0 0 0 1 0 0

7

the girl with flowers is cute are were flower … …

Page 8: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Features as distributed representations

8Deep Learning: What is meant by a distributed representation? https://www.quora.com/Deep-Learning-What-is-meant-by-a-distributed-representation/answer/Rangan-Majumder

Page 9: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Distributed word representations

9

f1 f2 f3 f4

flower 6 3 0 4

flowers 1 7 2 8

What is the cosine similarity between flower and flowers now?

Page 10: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Learning distributed word representations

The girl with the flowers is cute.

She has the flowers in her hand.

I picked these flowers myself.

The girl with a flower is cute.

She has a flower in her hand.

I picked this flower myself.

10

with the

has the

picked the

is cute

in her

myself

flowerswith a

has a

picked a

is cute

in her

myself

flower

Page 11: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

11http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png

Page 12: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

12

Page 13: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Word2Vec

13Mikolov et al., 2013. Efficient Estimation of Word Representations in Vector Space

Page 14: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

CBOW – continuous bag of words

• w(t-2), w(t-1), w(t+1), w(t+2) – one-hot vectors

• – a row in the parameter matrix

• C – the set of context vectors

• c – the size of the context window

• - linear projection• d – embedding size

• 14

Page 15: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Skip-gram model

• Predict the context words

• w(t) – one-hot vector

• Maximize: z

Page 16: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Training word embeddings

• General principle – maximize the probability of the:• Middle word, given the context words (CBOW)

• Context words, given the middle word (skip-gram)

• In case of skip-gram:• Given T training words in context

• Maximize:

• Minimize:

16

Page 17: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Training word embeddings

• Typically trained with gradient descent• You will learn more sophisticated methods in other courses

• Initialize the parameter vectors/matrices (somehow)

• Repeat until convergence:

• - the set of all trainable parameters

• - learning rate

17

Page 18: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Softmax vs log-linear model

Softmax is a log-linear model

Log-linear:

Softmax:

18

Page 19: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

The gradient of a log-linear model

19

Empirical count Expected count

Page 20: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

The gradients in skip-gram model

20

c – context wordw – middle word

Page 21: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

The problem with softmax gradients

• Computing is computationally very expensive. Why?

• The gradients always include the sum over the whole vocabulary

• This makes computation very inefficient

21

Page 22: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Negative sampling

The general idea:

Maximize the probability of the (word, context) pairs that came from the training data (instead of the probability of the context given the word)

Previously: maximize

Now: maximize

22

Page 23: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Skip-gram objective with negative sampling

Maximize:

• - the set of random negative samples

• In practice, the number of negative samples per each positive sample is between 2-20

23

Page 24: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Tools for training word embeddings

• Word2vec – Gensim includes both CBOW and skip-gram implementations

• Glove – optimizes the predictions of co-occurrence counts between words

• Polyglot

• Dependency-based word embeddings

24

Page 25: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Further reading on word embeddings

• Mikolov et al., 2013. Distributed representations of words and phrases and their compositionality

• Mikolov et al., 2013. Efficient estimation of word representations in vector space

• Goldberg and Levy, 2014. word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method

• Pennington et al., 2014. GloVe: Global Vectors for Word Representation

• Al-Rfou’ et al., 2013. Polyglot: Distributed Word Representations for Multilingual NLP

• Levy and Goldberg, 2014. Dependency-based word embeddings

25

Page 26: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Regularities between word embeddings

26Vector Representations of Words: https://www.tensorflow.org/tutorials/word2vec

Page 27: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Word embedding models as neural networks

27CS231n Convolutional Neural Networks for Visual Recognition: http://cs231n.github.io/assets/nn1/neural_net.jpeg

One-hot vector of the input word

Prediction of the contextword (softmax) or whetherthe (context, word) pairbelongs to the Data (negative sampling)

The row corresponding To the input word in

Word embeddings

Page 28: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Recurrent Neural Networks

28http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

Page 29: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

RNN Language Model

29https://www.linkedin.com/pulse/what-i-learned-from-deep-learning-summer-school-2016-hamid-palangi

Page 30: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Machine Translation with RNN

30

http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf

Page 31: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

RNN music generation

31Music Language Modeling with Recurrent Neural Networks: http://yoavz.com/music_rnn/

Page 32: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Sequence Models

32The Unreasonable Effectiveness of Recurrent Neural Models: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 33: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Recurrent Neural Networks

33

- Initial state

- a nonlinear function

<s>and so on …

Page 34: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Non-linear activation functions

34http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/

Page 35: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Cross-entropy loss function

35https://theneuralperspective.com/2016/10/02/02-logistic-regression/

Page 36: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Training neural networks

36

• Typically with stochastic or mini-batch gradient descent• (full batch) GD – gradients are computed based on all training items

• Mini-batch GD – at each step compute the gradients based on a small number (a mini-batch) of training samples: for instance 20 or 32 or 128 etc

• Stochastic GD – gradients are computed based on a single training item

• Gradients are computed using back-propagation• BP is an algorithm for an efficient application of the chain rule

• There are several versions of gradient descent that set the learning rates in a clever way• RMSProp, AdaGrad, AdaDelta, Momentum, Adam

Page 37: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Gated units

• RNN-s are supposed to remember long contexts but in practice they don’t

• Gated units, such as LSTM or GRU include gates that control:• How much from the next input is read in

• How much from the previous hidden state is remembered or forgotten

• How much from the cell state is used in the output

37Figure 12 from Herath et al., 2016. Going Deeper into Action Recognition: A Survey.

Page 38: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Tools for creating and training neural networks

38

• Python libraries that perform symbolic gradient computation• Keras

• Tensorflow

• Theano

• PyTorch

• Dynet

• …

• The field is developing rapidly

Page 39: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

RNN LM and word embeddings

39

• Inputs x – one-hot vectors

• Parameter matrix - word embeddings

• Training embeddings with word2vec or a similar model is faster than with RNNLM

• Pretrained word embeddings can be used to initialise the U matrix in RNNLM• Transfer learning

Page 40: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Further reading

• Understanding LSTM networks

• Mikolov et al., 2013. Linguistic Regularities in Continuous Space Word Representations

40

Page 41: Neural models in NLP - ut · Neural models in NLP ... CS231n Convolutional Neural Networks for Visual Recognition:

Recap

• Word embeddings are dense distributed representations of words

• Word embeddings are trained from (word, context) pairs using neural models

• Word embeddings can be viewed as automatically learned feature vectors

• Recurrent neural networks are neural sequence models often used in NLP

• Pretrained word embeddings can be used to initialize the embedding layer of the recurrent neural models with textual input

41