cs388: natural language processing lecture 6: neural networks · 2019. 9. 17. · recall: crfs ‣...

45
CS388: Natural Language Processing Greg Durrett Lecture 6: Neural Networks

Upload: others

Post on 07-Mar-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

CS388:NaturalLanguageProcessing

GregDurrett

Lecture6:NeuralNetworks

Page 2: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

Administrivia

‣Mini1gradedlaterthisweek

‣ Project1dueinaweek

Page 3: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

Recall:CRFs

‣ NaiveBayes:logisGcregression::HMMs:CRFslocalvs.globalnormalizaGon<->generaGvevs.discriminaGve

(locallynormalizeddiscriminaGvemodelsdoexist(MEMMs))

P (y|x) = 1

Zexp

nX

k=1

w>fk(x,y)

!y1 y2

x1

x2

x3

f1

f2

f3

‣ HMMs:inthestandardsetup,emissionsconsideronewordataGme

‣ CRFs:featuresovermanywordssimultaneously,non-independentfeatures(e.g.,suffixesandprefixes),doesn’thavetobeageneraGvemodel

‣ CondiGonalmodel:x’sareobserved

Page 4: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

Recall:SequenGalCRFs

‣Model:

‣ Inference:argmaxP(y|x)fromViterbi

‣ Learning:runforward-backwardtocomputeposteriorprobabiliGes;then

P (y|x) / expw>

"nX

i=2

ft(yi�1, yi) +nX

i=1

fe(yi, i,x)

#

@

@wL(y⇤,x) =

nX

i=1

fe(y⇤i , i,x)�

nX

i=1

X

s

P (yi = s|x)fe(s, i,x)

y1 y2 yn…�t

�e

‣ Emissionfeaturescaptureword-levelinfo,transiGonsenforcetagconsistency

Page 5: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ThisLecture

‣ Feedforwardneuralnetworks+backpropagaGon

‣ Neuralnetworkbasics

‣ ApplicaGons

‣ Neuralnetworkhistory

‣ Beamsearch:inafewlectures

‣ ImplemenGngneuralnetworks(ifGme)

‣ FinishdiscussionofNER

Page 6: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NER

Page 7: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NER

‣ CRFwithlexicalfeaturescangetaround85F1onthisproblem

‣ OtherpiecesofinformaGonthatmanysystemscapture

‣Worldknowledge:

ThedelegaGonmetthepresidentattheairport,Tanjugsaid.

Page 8: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ORG?PER?

NonlocalFeatures

ThedelegaGonmetthepresidentattheairport,Tanjugsaid.

ThenewsagencyTanjugreportedontheoutcomeofthemeeGng.

‣Morecomplexfactorgraphstructurescanletyoucapturethis,orjustdecodesentencesinorderandusefeaturesonprevioussentences

FinkelandManning(2008),RaGnovandRoth(2009)

Page 9: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

Semi-MarkovModels

BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.

‣ Chunk-levelpredicGonratherthantoken-levelBIO

‣ yisasetofspanscoveringthesentence

‣ Cons:there’sanextrafactorofninthedynamicprograms

{ { { { { {

PER O LOC ORG OO

‣ Pros:featurescanlookatwholespanatonce

SarawagiandCohen(2004)

Page 10: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

EvaluaGngNER

‣ PredicGonofallOssGllgets66%accuracyonthisexample!

BarackObamawilltraveltoHangzhoutodayfortheG20mee=ng.

PERSON LOC ORG

B-PER I-PER O O O B-LOC B-ORGO O O O O

‣Whatwereallywanttoknow:howmanynamedenGtychunkpredicGonsdidwegetright?‣ Precision:oftheoneswepredicted,howmanyareright?

‣ Recall:ofthegoldnamedenGGes,howmanydidwefind?

‣ F-measure:harmonicmeanofthesetwo

Page 11: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

HowwelldoNERsystemsdo?

RaGnovandRoth(2009)

Lampleetal.(2016)

BiLSTM-CRF+ELMo Petersetal.(2018)

92.2

BERT Devlinetal.(2019)

92.8

Page 12: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ModernEnGtyTyping

Choietal.(2018)

‣Moreandmoreclasses(17->112->10,000+)

Page 13: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetHistory

Page 14: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

History:NN“darkages”‣ Convnets:appliedtoMNISTbyLeCunin1998

‣ LSTMs:HochreiterandSchmidhuber(1997)

‣ Henderson(2003):neuralshin-reduceparser,notSOTA

Page 15: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

2008-2013:Aglimmeroflight…

‣ CollobertandWeston2011:“NLP(almost)fromscratch”‣ FeedforwardneuralnetsinducefeaturesforsequenGalCRFs(“neuralCRF”)

‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011versionGedSOTA

‣ Socher2011-2014:tree-structuredRNNsworkingokay

‣ Krizhevskeyetal.(2012):AlexNetforvision

Page 16: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

2014:Stuffstartsworking

‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMs)

‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassificaGon/senGment(convnets)

‣ 2015:explosionofneuralnetsforeverythingunderthesun

‣ ChenandManningtransiGon-baseddependencyparser(basedonfeedforwardnetworks)

‣Whatmadethesework?Data(notasimportantasyoumightthink),op-miza-on(iniGalizaGon,adapGveopGmizers),representa-on(goodwordembeddings)

Page 17: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetBasics

Page 18: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworks

‣WanttolearnintermediateconjuncGvefeaturesoftheinput

argmaxyw>f(x, y)‣ LinearclassificaGon:

themoviewasnotallthatgood

I[containsnot&containsgood]

‣ Howdowelearnthisifourfeaturevectorisjusttheunigramindicators?

I[containsnot],I[containsgood]

Page 19: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworks:XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn))y = x1 XOR x2

‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfuncGon

‣ Inputs

‣ Output

Page 20: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworks:XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

Page 21: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworks:XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

Page 22: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworks

Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

Linear model: y = w · x+ b

y = g(w · x+ b)y = g(Wx+ b)

Page 23: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworks

Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!

Page 24: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

DeepNeuralNetworks

Takenfromhvp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

}

outputoffirstlayer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

z = V(Wx+ b) + c

Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?

Page 25: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

FeedforwardNetworks,BackpropagaGon

Page 26: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

LogisGcRegressionwithNNsP (y|x) = exp(w>f(x, y))P

y0 exp(w>f(x, y0))‣ Singlescalarprobability

P (y|x) = softmax

�[w>f(x, y)]y2Y

� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)

softmax(p)i =exp(pi)Pi0 exp(pi0)

‣ sonmax:expsandnormalizesagivenvector

P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]

P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer

Page 27: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworksforClassificaGon

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

sonmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classes

probs

Page 28: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

TrainingNeuralNetworks

‣Maximizeloglikelihoodoftrainingdata

‣ i*:indexofthegoldlabel

‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

Page 29: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

CompuGngGradients

‣ GradientwithrespecttoW

ifi=i*zj � P (y = i|x)zj

�P (y = i|x)zj

@

@WijL(x, i⇤) =

zj � P (y = i|x)zj

�P (y = i|x)zj otherwise

‣ LookslikelogisGcregressionwithzasthefeatures!

i

j

{

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

W

Page 30: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NeuralNetworksforClassificaGon

V sonmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@Wz

Page 31: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

CompuGngGradients:BackpropagaGonz = g(V f(x))

AcGvaGonsathiddenlayer

‣ GradientwithrespecttoV:applythechainrule

err(root) = ei⇤ � P (y|x)dim=m dim=d

@L(x, i⇤)@z

= err(z) = W>err(root)

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

[somemath…]

@L(x, i⇤)@Vij

=@L(x, i⇤)

@z

@z

@Vij

Page 32: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

BackpropagaGon:Picture

V sonmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)err(z)

z

‣ Canforgeteverythinganerz,treatitastheoutputandkeepbackpropping

Page 33: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

BackpropagaGon:Takeaways‣ GradientsofoutputweightsWareeasytocompute—lookslikelogisGcregressionwithhiddenlayerzasfeaturevector

‣ CancomputederivaGveoflosswithrespecttoztoforman“errorsignal”forbackpropagaGon

‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropagaGon

‣ NeedtorememberthevaluesfromtheforwardcomputaGon

Page 34: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ApplicaGons

Page 35: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NLPwithFeedforwardNetworks

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample

emb(interest)

emb(rates)‣WeightmatrixlearnsposiGon-dependent

processingofthewords

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

Page 36: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NLPwithFeedforwardNetworks

‣ HiddenlayermixesthesedifferentsignalsandlearnsfeatureconjuncGons

Bothaetal.(2017)

Page 37: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NLPwithFeedforwardNetworks‣MulGlingualtaggingresults:

Bothaetal.(2017)

‣ GillickusedLSTMs;thisissmaller,faster,andbever

Page 38: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

SenGmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

Page 39: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

SenGmentAnalysis

{

{Bag-of-words

TreeRNNs/CNNS/LSTMS

WangandManning(2012)

Kim(2014)

Iyyeretal.(2015)

Page 40: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ImplementaGonDetails

Page 41: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ComputaGonGraphs

‣ CompuGnggradientsishard!ComputaGongraphabstracGonallowsustodefineacomputaGonsymbolicallyandwilldothisforus

‣ AutomaGcdifferenGaGon:keeptrackofderivaGves/beabletobackpropagatethrougheachfuncGon:

y = x * x (y,dy) = (x * x, 2 * x * dx)codegen

‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch

Page 42: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ComputaGonGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)

def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))

‣ Defineforwardpassfor

Page 43: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ComputaGonGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

ffnn = FFNN()

loss.backward()

probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

def make_update(input, gold_label):

ffnn.zero_grad() # clear gradient variables

ei*: one-hot vector of the label (e.g., [0, 1, 0])

Page 44: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

TrainingaModel

DefineacomputaGongraph

Foreachepoch:

Computelossonbatch

Foreachbatchofdata:

Decodetestset

Autogradtocomputegradients

TakestepwithopGmizer

Page 45: CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

NextTime

‣WordrepresentaGons/wordvectors

‣ word2vec,GloVe

‣ Trainingneuralnetworks