cs388: natural language processing lecture 6: neural networks · 2019. 9. 17. · recall: crfs ‣...
TRANSCRIPT
CS388:NaturalLanguageProcessing
GregDurrett
Lecture6:NeuralNetworks
Administrivia
‣Mini1gradedlaterthisweek
‣ Project1dueinaweek
Recall:CRFs
‣ NaiveBayes:logisGcregression::HMMs:CRFslocalvs.globalnormalizaGon<->generaGvevs.discriminaGve
(locallynormalizeddiscriminaGvemodelsdoexist(MEMMs))
P (y|x) = 1
Zexp
nX
k=1
w>fk(x,y)
!y1 y2
x1
x2
x3
f1
f2
f3
‣ HMMs:inthestandardsetup,emissionsconsideronewordataGme
‣ CRFs:featuresovermanywordssimultaneously,non-independentfeatures(e.g.,suffixesandprefixes),doesn’thavetobeageneraGvemodel
‣ CondiGonalmodel:x’sareobserved
Recall:SequenGalCRFs
‣Model:
‣ Inference:argmaxP(y|x)fromViterbi
‣ Learning:runforward-backwardtocomputeposteriorprobabiliGes;then
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
@
@wL(y⇤,x) =
nX
i=1
fe(y⇤i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
y1 y2 yn…�t
�e
‣ Emissionfeaturescaptureword-levelinfo,transiGonsenforcetagconsistency
ThisLecture
‣ Feedforwardneuralnetworks+backpropagaGon
‣ Neuralnetworkbasics
‣ ApplicaGons
‣ Neuralnetworkhistory
‣ Beamsearch:inafewlectures
‣ ImplemenGngneuralnetworks(ifGme)
‣ FinishdiscussionofNER
NER
NER
‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem
‣ OtherpiecesofinformaGonthatmanysystemscapture
‣Worldknowledge:
ThedelegaGonmetthepresidentattheairport,Tanjugsaid.
ORG?PER?
NonlocalFeatures
ThedelegaGonmetthepresidentattheairport,Tanjugsaid.
ThenewsagencyTanjugreportedontheoutcomeofthemeeGng.
‣Morecomplexfactorgraphstructurescanletyoucapturethis,orjustdecodesentencesinorderandusefeaturesonprevioussentences
FinkelandManning(2008),RaGnovandRoth(2009)
Semi-MarkovModels
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
‣ Chunk-levelpredicGonratherthantoken-levelBIO
‣ yisasetofspanscoveringthesentence
‣ Cons:there’sanextrafactorofninthedynamicprograms
{ { { { { {
PER O LOC ORG OO
‣ Pros:featurescanlookatwholespanatonce
SarawagiandCohen(2004)
EvaluaGngNER
‣ PredicGonofallOssGllgets66%accuracyonthisexample!
BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣Whatwereallywanttoknow:howmanynamedenGtychunkpredicGonsdidwegetright?‣ Precision:oftheoneswepredicted,howmanyareright?
‣ Recall:ofthegoldnamedenGGes,howmanydidwefind?
‣ F-measure:harmonicmeanofthesetwo
HowwelldoNERsystemsdo?
RaGnovandRoth(2009)
Lampleetal.(2016)
BiLSTM-CRF+ELMo Petersetal.(2018)
92.2
BERT Devlinetal.(2019)
92.8
ModernEnGtyTyping
Choietal.(2018)
‣Moreandmoreclasses(17->112->10,000+)
NeuralNetHistory
History:NN“darkages”‣ Convnets:appliedtoMNISTbyLeCunin1998
‣ LSTMs:HochreiterandSchmidhuber(1997)
‣ Henderson(2003):neuralshin-reduceparser,notSOTA
2008-2013:Aglimmeroflight…
‣ CollobertandWeston2011:“NLP(almost)fromscratch”‣ FeedforwardneuralnetsinducefeaturesforsequenGalCRFs(“neuralCRF”)
‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011versionGedSOTA
‣ Socher2011-2014:tree-structuredRNNsworkingokay
‣ Krizhevskeyetal.(2012):AlexNetforvision
2014:Stuffstartsworking
‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMs)
‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassificaGon/senGment(convnets)
‣ 2015:explosionofneuralnetsforeverythingunderthesun
‣ ChenandManningtransiGon-baseddependencyparser(basedonfeedforwardnetworks)
‣Whatmadethesework?Data(notasimportantasyoumightthink),op-miza-on(iniGalizaGon,adapGveopGmizers),representa-on(goodwordembeddings)
NeuralNetBasics
NeuralNetworks
‣WanttolearnintermediateconjuncGvefeaturesoftheinput
argmaxyw>f(x, y)‣ LinearclassificaGon:
themoviewasnotallthatgood
I[containsnot&containsgood]
‣ Howdowelearnthisifourfeaturevectorisjusttheunigramindicators?
I[containsnot],I[containsgood]
NeuralNetworks:XOR
x1
x2
x1 x2
1 1111
100 0
00
0
0
1 0
1
x1, x2
(generally x = (x1, . . . , xm))
y
(generally y = (y1, . . . , yn))y = x1 XOR x2
‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfuncGon
‣ Inputs
‣ Output
NeuralNetworks:XOR
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1“or”
y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
(looks like action potential in neuron)
NeuralNetworks:XORy = a1x1 + a2x2
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1
Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
x2
x1
“or”y = �x1 � x2 + 2 tanh(x1 + x2)
NeuralNetworks
Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Warp space
ShiftNonlinear transformation
Linear model: y = w · x+ b
y = g(w · x+ b)y = g(Wx+ b)
NeuralNetworks
Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!
DeepNeuralNetworks
Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
}
outputoffirstlayer
z = g(Vg(Wx+ b) + c)
z = g(Vy + c)
z = V(Wx+ b) + c
Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?
FeedforwardNetworks,BackpropagaGon
LogisGcRegressionwithNNsP (y|x) = exp(w>f(x, y))P
y0 exp(w>f(x, y0))‣ Singlescalarprobability
P (y|x) = softmax
�[w>f(x, y)]y2Y
� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)
softmax(p)i =exp(pi)Pi0 exp(pi0)
‣ sonmax:expsandnormalizesagivenvector
P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]
P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer
NeuralNetworksforClassificaGon
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
sonmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classes
probs
TrainingNeuralNetworks
‣Maximizeloglikelihoodoftrainingdata
‣ i*:indexofthegoldlabel
‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex
z = g(V f(x))P (y|x) = softmax(Wz)
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)
CompuGngGradients
‣ GradientwithrespecttoW
ifi=i*zj � P (y = i|x)zj
�P (y = i|x)zj
@
@WijL(x, i⇤) =
zj � P (y = i|x)zj
�P (y = i|x)zj otherwise
‣ LookslikelogisGcregressionwithzasthefeatures!
i
j
{
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
W
NeuralNetworksforClassificaGon
V sonmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@Wz
CompuGngGradients:BackpropagaGonz = g(V f(x))
AcGvaGonsathiddenlayer
‣ GradientwithrespecttoV:applythechainrule
err(root) = ei⇤ � P (y|x)dim=m dim=d
@L(x, i⇤)@z
= err(z) = W>err(root)
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
[somemath…]
@L(x, i⇤)@Vij
=@L(x, i⇤)
@z
@z
@Vij
BackpropagaGon:Picture
V sonmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)err(z)
z
‣ Canforgeteverythinganerz,treatitastheoutputandkeepbackpropping
BackpropagaGon:Takeaways‣ GradientsofoutputweightsWareeasytocompute—lookslikelogisGcregressionwithhiddenlayerzasfeaturevector
‣ CancomputederivaGveoflosswithrespecttoztoforman“errorsignal”forbackpropagaGon
‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropagaGon
‣ NeedtorememberthevaluesfromtheforwardcomputaGon
ApplicaGons
NLPwithFeedforwardNetworks
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample
emb(interest)
emb(rates)‣WeightmatrixlearnsposiGon-dependent
processingofthewords
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
NLPwithFeedforwardNetworks
‣ HiddenlayermixesthesedifferentsignalsandlearnsfeatureconjuncGons
Bothaetal.(2017)
NLPwithFeedforwardNetworks‣MulGlingualtaggingresults:
Bothaetal.(2017)
‣ GillickusedLSTMs;thisissmaller,faster,andbever
SenGmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput
Iyyeretal.(2015)
SenGmentAnalysis
{
{Bag-of-words
TreeRNNs/CNNS/LSTMS
WangandManning(2012)
Kim(2014)
Iyyeretal.(2015)
ImplementaGonDetails
ComputaGonGraphs
‣ CompuGnggradientsishard!ComputaGongraphabstracGonallowsustodefineacomputaGonsymbolicallyandwilldothisforus
‣ AutomaGcdifferenGaGon:keeptrackofderivaGves/beabletobackpropagatethrougheachfuncGon:
y = x * x (y,dy) = (x * x, 2 * x * dx)codegen
‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch
ComputaGonGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)
def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))
‣ Defineforwardpassfor
ComputaGonGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
ffnn = FFNN()
loss.backward()
probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)
optimizer.step()
def make_update(input, gold_label):
ffnn.zero_grad() # clear gradient variables
ei*: one-hot vector of the label (e.g., [0, 1, 0])
TrainingaModel
DefineacomputaGongraph
Foreachepoch:
Computelossonbatch
Foreachbatchofdata:
Decodetestset
Autogradtocomputegradients
TakestepwithopGmizer
NextTime
‣WordrepresentaGons/wordvectors
‣ word2vec,GloVe
‣ Trainingneuralnetworks