23 deep nlp - wei xu · 2020-06-08 · takeaways ‣ neural networks have several advantages for...
TRANSCRIPT
Deep Learning for NLP
Many slides from Greg Durrett
Instructor: Wei Xu Ohio State University
CSE 5525
Outline
‣ Motivation for neural networks
‣ Application examples
‣ Convolutional neural networks
‣ Tools
‣ Feedforward neural networks
‣ Applying feedforward neural networks to NLP
Sentiment Analysis
the movie was very good 👍
Sentiment Analysis with Linear
the movie was very good
the movie was not bad
👍
the movie was very bad 👎
I[good]
I[bad]
I[not bad]
the movie was not very good 👎 I[not very good]
Unigrams
Unigrams
Bigrams
Trigrams
👍
the movie was not really very enjoyable 4-grams!
Label Feature TypeExample
Drawbacks
‣ More complex features capture interactions but scale badly(13M unigrams, 1.3B 4-grams in Google n-grams)
‣ Instead of more complex linear functions, let’s use simpler nonlinear functions, namely neural networks
the movie was not really very enjoyable
‣ Can we do better than seeing every n-gram once in the training data?
not very good not so great
Neural Networks: XOR
x1
x2
x1 x2
1 1111
100 0
00
0
0
1 0
1
‣ Inputs
‣ Output
x1, x2
(generally x = (x1, . . . , xm))
y
(generally y = (y1, . . . , yn))y = x1 XOR x2
‣ Let’s see how we can use neural netsto learn a simple nonlinear function
Neural Networks: XOR
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1“or”
y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
(looks like action potential in neuron)
Neural Networks: XORy = a1x1 + a2x2
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1
Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
x2
x1
“or”y = �x1 � x2 + 2 tanh(x1 + x2)
Neural Networks: XOR
x1
x2
0
1 -1
0
x2
x1
[not]
[good] y = �2x1 � x2 + 2 tanh(x1 + x2)
I
I
Neural Networks
Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Warp space
ShiftNonlinear transformation
(Linear model: ) y = w · x+ b
y = g(w · x+ b)
y = g(Wx+ b)
Neural Networks
Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Linear classifier
Neural network
…possible because we transformed the space!
Deep Neural Networks
Adopted from Chris Dyer
y1 = g(w1 · x+ b1)
(this was our neural net from the XOR example)
Deep Neural Networks
Adopted from Chris Dyer
y1 = g(w1 · x+ b1)
Deep Neural Networks
Adopted from Chris Dyer
}output of first layer
z = g(Vg(Wx+ b) + c)
z = g(Vy + c)
Input OutputHiddenLayer
Neural Networks
Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Linear classifier
Neural network
…possible because we transformed the space!
Deep Neural Networks
Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Deep Neural Networks
Adopted from Chris Dyer
}output of first layer
z = g(Vg(Wx+ b) + c)
z = VWx+Vb+ c
With no nonlinearity:
z = Ux+ dEquivalent to
Input OutputHiddenLayer
Deep Neural NetworksInput OutputHidden
Layer
[good]
[not]
‣ Nodes in the hidden layercan learn interactions orconjunctions of features
II
…
…
…
not OR good
y = �2x1 � x2 + 2 tanh(x1 + x2)
Learning Neural Networks
change in output w.r.t. input
change in output w.r.t. hidden
change in hidden w.r.t. input
‣ Computing these looks like running thisnetwork in reverse (backpropagation)
Input OutputHiddenLayer
Outline
‣ Motivation for neural networks
‣ Application examples
‣ Convolutional neural networks
‣ Tools
‣ Feedforward neural networks
‣ Applying feedforward neural networks to NLP
Feedforward Bag-of-words
[good]
[not]II
[bad]I[to]I[a]I
y = g(Wx+ b)
binary vector,length = vocabulary size
real-valued matrix,dims = vocabulary size (~10k) x hidden layer size (~100)
x1
x2
0
1 -1
0[not]
[good]
I
I
Drawbacks to FFBoW
[good]
[not]II
[bad]I[to]I[a]I
‣ really not very good and really not very enjoyable — we don’t know the relationship between good and enjoyable
‣ Doesn’t preserve ordering in the input
‣ Lots of parameters to learn
Word Embeddings
goodenjoyable
bad
dog
great
is
‣ word2vec: turn each word into a 100-dimensional vector
‣ Context-based embeddings: find a vector predictive of a word’s context
‣ Words in similar contexts will end up with similar vectors
Feedforward with word vectors
‣ Can capture word similarity
themovie
was
good.
themovie
was
great.
‣ Each x now represents multiple bits of inputy = g(Wx+ b)
binary vector,length = sentence length x vector size
hidden layer size ~100 x (sentence length (~10) x vector size (~100))
Feedforward with word vectorsthe
moviewas
good.
themovie
was
verygood
y = g(Wx+ b) ‣ Need our model to be shift-invariant, like bag-of-words is
Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler
nonlinear functions
‣ Feedforward bag-of-words: didn’t take advantage of word similarity,lots of parameters to learn
‣ Feedforward with word vectors: our parameters are attached to particular indices in a sentence
‣ Solution: convolutional neural nets
Outline
‣ Motivation for neural networks
‣ Application examples
‣ Convolutional neural networks
‣ Tools
‣ Feedforward neural networks
‣ Applying feedforward neural networks to NLP
Convolutional Networksthe
moviewas
good.
0.030.020.11.10.0
max = 1.1“good” filter output}
themovie
wasgood
.
0.030.020.11.10.0
max = 1.1
0.1
0.3
0.1
“bad”
“okay”
“terrible”
}Convolutional Networks
themovie
wasgood
.
0.030.020.11.10.0
max = 1.1
0.1“bad”
}Convolutional Networks
‣ Filters are initialized randomly and then learned
‣ Input: n vectors of length m each k filters of length m each k filter outputs of length 1 each
1.10.10.30.1
Features for a classifier, or input to another neural net layer
‣ Takes variable-length input and turns it into fixed-length output
themovie
wasgreat
.
0.030.020.11.80.0
max = 1.8}‣ Word vectors for similar words are similar, so convolutional filters
will have similar outputs
Convolutional Networks
themovie
was
good.
0.030.05
0.1} max = 1.5
“not good”
not0.21.5
‣ Analogous to bigram features in bag-of-words models
Convolutional Networks
}}+
+
Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler
nonlinear functions
‣ Convolutional networks let us take advantage of word similarity
‣ Convolutional networks are translation-invariant like bag-of-words
‣ Convolutional networks can capture local interactions with filtersof width > 1 (i.e. )“not good”
Outline
‣ Motivation for neural networks
‣ Application examples
‣ Convolutional neural networks
‣ Tools
‣ Feedforward neural networks
‣ Applying feedforward neural networks to NLP
Sentence Classification
fullyconnected
prediction
themovie
was
good.
not
convolutional
Object Recognitionconvolutional
layersfully connected layers
AlexNet (2012)
Conv layer 1
Conv layer 3
Neural networks are
‣ NNs are built from convolutional layers, fully connected layers, and some other types
‣ Can chain these together into various architectures
‣ Any neural network built this way can be learned from data!
Sentence Classification
themovie
was
good.
not
convolutional fullyconnected
prediction
Sentence Classification
question type classification
subjectivity/objectivitydetection
movie reviewsentiment
Taken from Kim (2014)
product reviews
‣ Outperforms highly-tuned bag-of-words model
Entity Linking
Francis-Landau, Durrett, and Klein (NAACL 2016)
Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.
Lance Edward Armstrong is an American former professional road cyclist
Armstrong County is a county in Pennsylvania…
??
‣ Conventional: compare vectors from tf-idf features for overlap‣ Convolutional networks can capture many of the same effects: distill
notions of topic from n-grams
Entity Linking
Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.
Armstrong County is a county in Pennsylvania…
Lance Edward Armstrong is an American former professional road cyclist
topic vector topic vector topic vector
similar — probable link dissimilar — improbable link
convolutional convolutional convolutional
Francis-Landau, Durrett, and Klein (NAACL 2016)
Syntactic Parsing
He wrote a long report on Mars .
PP
NP
He wrote a long report on Mars .
PPNP
VP
VBDNP
My report
Fig. 1report—on Mars wrote—on Mars
Chart Parsing
He wrote a long report on Mars
NPPP
NP
score(rule) +chart(left child) +chart(right child)
chart value =
Syntactic Parsing
score
He wrote a long report on Mars .
PPNP
NP
2 5 7
Left child last word = report ∧ NP PPNP
w>f NP PPNP
2 5 7=
wrote a long report on Mars .
‣ Features need to combine surface information and syntactic information,but looking at words directly ends up being very sparse
feat = I
v
neural network
He wrote a long report on Mars .2 5 7
Scoring parses with neural nets
s>
Durrett and Klein (ACL 2015)
scorePPNP
NP= s>. vector representation of
rule being applied
Jupiter
He wrote a long report on Mars
NPPP
NP
‣ Feedforward pass on nets
‣ Run CKY dynamic program‣ Discrete feature computation
+Discrete Continuous
Parsing a sentence:
Syntactic Parsing
Durrett and Klein (ACL 2015)
Machine Translation
le chat a mangé
the cat ate STOP
‣ Long short-term memory units
Long Short-Term Memory Networks
‣ Map sequence of inputs to sequence of outputs
Taken from http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Machine Translation
le chat a mangé
the cat ate STOP
‣ Google is moving towards this architecture, performance is constantly improving compared to phrase-based methods
Neural Network
‣ Torch: http://torch.ch/
‣ Tensorflow: https://www.tensorflow.org/
‣ Theano: http://deeplearning.net/software/theano/
‣ By Google, actively maintained, bindings for many languages
‣ University of Montreal, less and less maintained
‣ Facebook AI Research, Lua
Neural Network
http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png
Word Vector Tools‣ Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html
‣ GLoVe: http://nlp.stanford.edu/projects/glove/‣ Word vectors trained on very large corpora
‣ Python code, actively maintainedhttps://code.google.com/archive/p/word2vec/
Convolutional Networks‣ CNNs for sentence class.: https://github.com/yoonkim/CNN_sentence‣ Based on tutorial from: http://deeplearning.net/tutorial/lenet.html‣ Python code‣ Trains very quickly
Takeaways
‣ Neural networks have several advantages for NLP:‣ We can use simpler nonlinear functions instead of more
complex linear functions‣ We can take advantage of word similarity‣ We can build models that are both position-dependent
(feedforward neural networks) and position-independent (convolutional networks)
‣ NNs have natural applications to many problems
‣ While conventional linear models often still do well, neural nets are increasingly the state-of-the-art for many tasks