neural networks - cocoxu.github.io · neural networks wei xu (many slides from greg durrett and...

NeuralNetworks

WeiXu(many slides from Greg Durrett and Philipp Koehn)

Recap:LossFunc7ons

Hinge(SVM)

Logis7c

Perceptron

0-1(ideal)

Loss

w>x

Recap:Logis7cRegression

‣ Tolearnweights:maximizediscrimina7veloglikelihoodofdataP(y|x)

P (y = +|x) = logistic(w>x)

P (y = +|x) =exp(

Pni=1 wixi)

1 + exp(Pn

i=1 wixi)

L(xj , yj = +) = logP (yj = +|xj)

=nX

i=1

wixji � log

1 + exp

nX

i=1

wixji

!!

sumoverfeatures

Recap:Mul7classLogis7cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili(es

exp6.05

22.2

0.55

probabili7esmustbe>=0

unnormalizedprobabili7es

normalize0.21

0.77

0.02

probabili7esmustsumto1

probabili7es

1.00

0.00

0.00correct(gold)probabili7es

toomanydrugtrials,

toofewpa4ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56

sumoveroutputspacetonormalize

‣ Training:maximize

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!L(x, y) =

nX

j=1

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Recap:Mul7classLogis7cRegression

Administrivia

‣ Homework1Graded(onCarmen)

‣ Homework2willbereleasedsoon.

‣ Reading:Eisenstein3.1-3.3,Jurafsky+Mar7n7.1-7.4

ThisLecture

‣ Feedforwardneuralnetworks+backpropaga7on

‣ Neuralnetworkbasics

‣ Applica7ons

‣ Neuralnetworkhistory

‣ Implemen7ngneuralnetworks(if7me)

History:NN“darkages”

‣ ConvNets:appliedtoMNISTbyLeCunin1998

‣ LSTMs:HochreiterandSchmidhuber(1997)

‣ Henderson(2003):neuralshie-reduceparser,notSOTAhgps://www.youtube.com/watch?v=FwFduRA_L6Q&feature=youtu.be

hgps://www.andreykurenkov.com/wri7ng/ai/a-brief-history-of-neural-nets-and-deep-learning/

https://www.youtube.com/watch?v=FwFduRA_L6Q&feature=youtu.be

https://www.andreykurenkov.com/writing/ai/a-brief-history-of-neural-nets-and-deep-learning/

2008-2013:Aglimmeroflight…

‣ CollobertandWeston2011:“NLP(almost)fromscratch”

‣ Feedforwardneuralnetsinducefeaturesforsequen7alCRFs(“neuralCRF”)

‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011version7edSOTA

‣ Socher2011-2014:tree-structuredRNNsworkingokay

‣ Krizhevskeyetal.(2012):AlexNetforvision

2014:Stuffstartsworking

‣ Sutskeveretal.(2014)+Bahdanauetal.(2015):seq2seq+agen7onforneuralMT(LSTMsworkforNLP?)

‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassifica7on/sen7ment(convnetsworkforNLP?)

‣ 2015:explosionofneuralnetsforeverythingunderthesun

‣ ChenandManning(2014)transi7on-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)

Whydidn’ttheyworkbefore?

‣ Datasetstoosmall:forMT,notreallybegerun7lyouhave1M+parallelsentences(andreallyneedalotmore)

‣Op(miza(onnotwellunderstood:goodini7aliza7on,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box

‣ Regulariza(on:dropoutispregyhelpful

‣ Inputs:needwordrepresenta7onstohavetherightcon7nuousseman7cs

‣ Computersnotbigenough:can’trunforenoughitera7ons

NeuralNetBasics

NeuralNetworks:mo7va7on

‣ Howcanwedononlinearclassifica7on?Kernelsaretooslow…

‣Wanttolearnintermediateconjunc7vefeaturesoftheinput

argmaxyw>f(x, y)‣ Linearclassifica7on:

themoviewasnotallthatgood

I[containsnot&containsgood]

NeuralNetworks:XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn)) y = x1 XOR x2

‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfunc7on

‣ Inputs

‣ Output

NeuralNetworks:XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

NeuralNetworks:XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

NeuralNetworks:XOR

x1

x2

0

1 -1

0

x2

x1

[not]

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

I

I

themoviewasnotallthatgood

NeuralNetworks

Takenfromhgp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

Linear model: y = w · x+ b

y = g(w · x+ b)y = g(Wx+ b)

tanh

NeuralNetworks


Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!

DeepNeuralNetworks

Adopted from Chris Dyer

}outputoffirstlayer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input Second Layer

FirstLayer

“Feedforward”computa7on(notrecurrent)

z = V(Wx+ b) + c

Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?

DeepNeuralNetworks


FeedforwardNetworks,Backpropaga7on

Administrivia

‣ Homework2isreleased,dueonFebruary18(startearly!).

‣ Reading:Eisenstein7.0-7.4,Jurafsky+Mar7nChapter8

SimpleNeuralNetwork

‣ Assumesthatthelabelsyareindexedandassociatedwithcoordinatesinavectorspace

9Simple Neural Network

11

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• One innovation: bias units (no inputs, always value 1)

Philipp Koehn Machine Translation: Introduction to Neural Networks 27 September 2018

SimpleNeuralNetwork 9Simple Neural Network

11

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• One innovation: bias units (no inputs, always value 1)


SimpleNeuralNetwork

‣ Tryouttwoinputvalues

10Sample Input

1

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Try out two input values

• Hidden unit computation

sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1

1 + e�2.2= 0.90

sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1

1 + e1.6= 0.17


SimpleNeuralNetwork

‣ Tryouttwoinputvalues‣ Hiddenunitcomputa7on

11Computed Hidden

.90

.17

1

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9



sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1

1 + e�2.2= 0.90

sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1

1 + e1.6= 0.17


11Computed Hidden

.90

.17

1

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9



sigmoid(1.0 ⇥ 3.7 + 0.0 ⇥ 3.7 + 1 ⇥�1.5) = sigmoid(2.2) =1

1 + e�2.2= 0.90

sigmoid(1.0 ⇥ 2.9 + 0.0 ⇥ 2.9 + 1 ⇥�4.5) = sigmoid(�1.6) =1

1 + e1.6= 0.17


SimpleNeuralNetwork

‣ Outputunitcomputa7on

13Computed Output

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Output unit computation

sigmoid(.90 ⇥ 4.5 + .17 ⇥�5.2 + 1 ⇥�2.0) = sigmoid(1.17) =1

1 + e�1.17= 0.76


13Computed Output

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Output unit computation

sigmoid(.90 ⇥ 4.5 + .17 ⇥�5.2 + 1 ⇥�2.0) = sigmoid(1.17) =1

1 + e�1.17= 0.76


SimpleNeuralNetwork

‣ NetworkimplementsXOR‣ h0isOR,h1isAND

14Output for all Binary Inputs

Input x0 Input x1 Hidden h0 Hidden h1 Output y00 0 0.12 0.02 0.18 ! 00 1 0.88 0.27 0.74 ! 11 0 0.73 0.12 0.74 ! 11 1 0.99 0.73 0.33 ! 0

• Network implements XOR

– hidden node h0 is OR– hidden node h1 is AND– final layer operation is h0 ��h1

• Power of deep neural networks: chaining of processing stepsjust as: more Boolean circuits ! more complex computations possible


‣ Computedoutput:‣ Correctoutput:

‣ Q:howdoweadjusttheweights?

Error20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

• Computed output: y = .76

• Correct output: t = 1.0

) How do we adjust the weights?


20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





GradientDescent 22Gradient Descent

λ

error(λ)

gradient = 1

current λoptimal λ


GradientDescent 23Gradient Descent

Gradient for w1

Gradient for w

2

OptimumCurrent Point

Combined Gradient


Deriva7veofSigmoid

‣ Deriva7ve:

‣ Sigmoidfunc7on:24Derivative of Sigmoid

• Sigmoid sigmoid(x) =1

1 + e�x

• Reminder: quotient rule⇣f(x)

g(x)

⌘0=

g(x)f 0(x) � f(x)g0(x)

g(x)2

• Derivative d sigmoid(x)dx

=d

dx

1

1 + e�x

=0 ⇥ (1 � e�x) � (�e�x)

(1 + e�x)2

=1

1 + e�x

⇣ e�x

1 + e�x

⌘

=1

1 + e�x

⇣1 �

1

1 + e�x

⌘

= sigmoid(x)(1 � sigmoid(x))


24Derivative of Sigmoid• Sigmoid sigmoid(x) =

1

1 + e�x

• Reminder: quotient rule⇣f(x)

g(x)

⌘0=

g(x)f 0(x) � f(x)g0(x)

g(x)2

• Derivative d sigmoid(x)dx

=d

dx

1

1 + e�x

=0 ⇥ (1 � e�x) � (�e�x)

(1 + e�x)2

=1

1 + e�x

⇣ e�x

1 + e�x

⌘

=1

1 + e�x

⇣1 �

1

1 + e�x

⌘

= sigmoid(x)(1 � sigmoid(x))


‣ Error(L2norm):‣ Deriva7veoferrorwithregardtooneweight:

FinalLayerUpdate‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:

25Final Layer Update

• Linear combination of weights s =P

kwkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t� y)2

• Derivative of error with regard to one weight wk

dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk



‣ Errorisdefinedwithrespectto:

FinalLayerUpdate(1)‣ Linearcombina7onofweights:‣ Ac7va7onfunc7on:



kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


26Final Layer Update (1)• Linear combination of weights s =

Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk

• Error E is defined with respect to y

dE

dy=

d

dy

1

2(t� y)2 = �(t� y)



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dE

dy=

d

dy

1

2(t� y)2 = �(t� y)



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dE

dy=

d

dy

1

2(t� y)2 = �(t� y)



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dE

dy=

d

dy

1

2(t� y)2 = �(t� y)



‣ withrespecttois:




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dE

dy=

d

dy

1

2(t� y)2 = �(t� y)



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk

• y with respect to x is sigmoid(s)

dy

ds=

d sigmoid(s)ds

= sigmoid(s)(1� sigmoid(s)) = y(1� y)



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dy

ds=

d sigmoid(s)ds




Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dy

ds=

d sigmoid(s)ds




Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dy

ds=

d sigmoid(s)ds




‣ isweightedlinearcombina7onofhiddennodevalues:




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk




kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


dy

ds=

d sigmoid(s)ds




Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk

• x is weighted linear combination of hidden node values hk

ds

dwk=

d

dwk

X

k

wkhk = hk



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


ds

dwk=

d

dwk

X

k

wkhk = hk



Pkwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


ds

dwk=

d

dwk

X

k

wkhk = hk


‣ Deriva7veoferrorwithregardtooneweight:

PuxngitAllTogether



kwkhk




dE

dwk=

dE

dy

dy

ds

ds

dwk


‣Weightedadjustmentwillbescaledbyafixelearningrate:

29Putting it All Together


dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk

– error– derivative of sigmoid: y0

• Weight adjustment will be scaled by a fixed learning rate µ

�wk = µ (t� y) y0 hk




dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk







dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk







dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk







dE

dwk=

dE

dy

dy

ds

ds

dwk

= �(t� y) y(1� y) hk





‣ Erroriscomputedoveralljoutputnodes:

Mul7pleOutputNodes

‣Weightsareadjustedaccordingtothenodetheypointto:

30Multiple Output Nodes

• Our example only had one output node

• Typically neural networks have multiple output nodes

• Error is computed over all j output nodes

E =X

j

1

2(tj � yj)

2

• Weights k ! j are adjusted according to the node they point to

�wj k = µ(tj � yj) y0j hk


30Multiple Output Nodes

• Our example only had one output node

• Typically neural networks have multiple output nodes

• Error is computed over all j output nodes

E =X

j

1

2(tj � yj)

2

• Weights k ! j are adjusted according to the node they point to

�wj k = µ(tj � yj) y0j hk


‣ Inahiddenlayer,wedonothaveatargetoutputvalue‣ But,wecancomputehowmucheachnodecontributedtodownstreamerror‣ Defini7onoferrortermofeachnode:

HiddenLayerUpdate

‣ Back-propagatetheerrorterm:

‣ Universalupdateformula:

31Hidden Layer Update• In a hidden layer, we do not have a target output value

• But we can compute how much each node contributed to downstream error

• Definition of error term of each node

�j = (tj � yj) y0j

• Back-propagate the error term(why this way? there is math to back it up...)

�i =⇣X

j

wj i�j⌘y0i

• Universal update formula�wj k = µ �j hk





�j = (tj � yj) y0j


�i =⇣X

j

wj i�j⌘y0i






�j = (tj � yj) y0j


�i =⇣X

j

wj i�j⌘y0i



‣ Computedoutput:‣ Correctoutput:

‣ Q:howdoweadjusttheweights?

OurExample20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G



• Final layer weight updates (learning rate µ = 10)– �G = (t� y) y0 = (1� .76) 0.181 = .0434

– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434


32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G




– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434


‣ Computedoutput:‣ Correctoutput:‣ Finallayerweightupdates(learningrate):

OurExample20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G




– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434


32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G




– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434


‣ Computedoutput:‣ Correctoutput:‣ Finallayerweightupdates(learningrate):

OurExample20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





20Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9





32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G




– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434


32Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G




– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434


33Our Example

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——




– �wGD = µ �G hD = 10⇥ .0434⇥ .90 = .391

– �wGE = µ �G hE = 10⇥ .0434⇥ .17 = .074

– �wGF = µ �G hF = 10⇥ .0434⇥ 1 = .434


‣ HiddennodeD:

‣ HiddennodeE:

HiddenLayerUpdates 34Hidden Layer Updates

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——

• Hidden node D

– �D =⇣P

j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175

– �wDA = µ �D hA = 10⇥ .0175⇥ 1.0 = .175– �wDB = µ �D hB = 10⇥ .0175⇥ 0.0 = 0– �wDC = µ �D hC = 10⇥ .0175⇥ 1 = .175

• Hidden node E

– �E =⇣P

j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464

– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.


34Hidden Layer Updates

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——

• Hidden node D

– �D =⇣P

j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175


• Hidden node E

– �E =⇣P

j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464

– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.


34Hidden Layer Updates

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0-4.6-1.5

3.72.9

3.7

2.9

A

B

C

D

E

F

G

4.891 —

-5.126 ——

-1.566 ——

• Hidden node D

– �D =⇣P

j wj i�j⌘y0D = wGD �G y0D = 4.5⇥ .0434⇥ .0898 = .0175


• Hidden node E

– �E =⇣P

j wj i�j⌘y0E = wGE �G y0E = �5.2⇥ .0434⇥ 0.2055 = �.0464

– �wEA = µ �E hA = 10⇥�.0464⇥ 1.0 = �.464– etc.


Logis7cRegressionwithNNs

P (y|x) = exp(w>f(x, y))Py0 exp(w>f(x, y0))

‣ Singlescalarprobability

P (y|x) = softmax�[w>f(x, y)]y2Y

� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)

softmax(p)i =exp(pi)Pi0 exp(pi0)

‣ soemax:expsandnormalizesagivenvector

P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]

P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer

NeuralNetworksforClassifica7on

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

soemaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classes

probs

We can think of a neural network classifier with one hidden layer as building a vector z which is a hidden layer representation of the input, and then running standard logistic regression on the features that the network develops in z.

TrainingNeuralNetworks

‣Maximizeloglikelihoodoftrainingdata

‣ i*:indexofthegoldlabel

‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � logX

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

Compu7ngGradients

‣ GradientwithrespecttoW

ifi=i*zj � P (y = i|x)zj

�P (y = i|x)zj

@

@WijL(x, i⇤) =

zj � P (y = i|x)zj

�P (y = i|x)zj otherwise

‣ Lookslikelogis7cregressionwithzasthefeatures!

i

j

{


j

exp(Wz) · ej

W

index of gold label index of vector z

index of output space Y

num_classesxdmatrix

NeuralNetworksforClassifica7on

V soemaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@Wz

Compu7ngGradients

‣ Gradientwithrespectto


j

exp(Wz) · ej

@L(x, i⇤)@z

= Wi⇤ �X

j

P (y = j|x)Wj

z

err(root) = ei⇤ � P (y|x)dim=num_classesdim=d

@L(x, i⇤)@z

= err(z) = W>err(root)

[somemath…]

Backpropaga7on:Picture

V soemaxWf(x)

zg P

(y|x)


@L@W err(root)err(z)

z

‣ Canforgeteverythingaeerz,treatitastheoutputandkeepbackpropping

Compu7ngGradients:Backpropaga7onz = g(V f(x))

Ac7va7onsathiddenlayer

‣ GradientwithrespecttoV:applythechainrule

err(root) = ei⇤ � P (y|x)dim=num_classes dim=d

@L(x, i⇤)@z



j

exp(Wz) · ej

[somemath…]

@L(x, i⇤)@Vij

=@L(x, i⇤)

@z

@z

@Vij

Backpropaga7on:Picture

V soemaxWf(x)

zg P

(y|x)


@L@W err(root)@z

@Verr(z)

zf(x)

Backpropaga7on

‣ Step1:compute err(root) = ei⇤ � P (y|x)

‣ Step2:computederiva7vesofWusingerr(root)

‣ Step3:compute @L(x, i⇤)@z


‣ Step4:computederiva7vesofVusingerr(z)

‣ Step5+:con7nuebackpropaga7on(computeerr(f(x))ifnecessary…)


(vector)

(vector)

(matrix)

(matrix)

Backpropaga7on:Takeaways

‣ GradientsofoutputweightsWareeasytocompute—lookslikelogis7cregressionwithhiddenlayerzasfeaturevector

‣ Cancomputederiva7veoflosswithrespecttoztoforman“errorsignal”forbackpropaga7on

‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropaga7on

‣ Needtorememberthevaluesfromtheforwardcomputa7on

Applica7ons

NLPwithFeedforwardNetworks

Bothaetal.(2017)

…

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample

emb(interest)

emb(rates)‣Weightmatrixlearnsposi7on-dependent

processingofthewords

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

NLPwithFeedforwardNetworks

‣ Hiddenlayermixesthesedifferentsignalsandlearnsfeatureconjunc7ons

Bothaetal.(2017)

NLPwithFeedforwardNetworks‣Mul7lingualtaggingresults:

Bothaetal.(2017)

‣ GillickusedLSTMs;thisissmaller,faster,andbeger

Sen7mentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

Sen7mentAnalysis

{

{Bag-of-words

TreeRNNs/CNNS/LSTMS

WangandManning(2012)

Kim(2014)

Iyyeretal.(2015)

CoreferenceResolu7on‣ Feedforwardnetworksiden7fycoreferencearcs

ClarkandManning(2015),Wisemanetal.(2015)

PresidentObamasigned…

Helatergaveaspeech…

?

Implementa7onDetails

Computa7onGraphs

‣ Compu7nggradientsishard!

‣ Automa7cdifferen7a7on:instrumentcodetokeeptrackofderiva7ves

y = x * x (y,dy) = (x * x, 2 * x * dx)codegen

‣ Computa7onisnowsomethingweneedtoreasonaboutsymbolically

‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch

Computa7onGraphsinPytorch


class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)

def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))

‣ Defineforwardpassfor

Computa7onGraphsinPytorch


ffnn = FFNN()

loss.backward()

probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

def make_update(input, gold_label):

ffnn.zero_grad() # clear gradient variables

ei*: one-hot vector of the label (e.g., [0, 1, 0])

TrainingaModel

Defineacomputa7ongraph

Foreachepoch:

Computelossonbatch

Foreachbatchofdata:

Decodetestset

Autogradtocomputegradientsandtakestep

Batching

‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera7ons

‣ Needtomakethecomputa7ongraphprocessabatchatthesame7me

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

...

‣ Batchsizesfrom1-100oeenworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

...

neural networks - cocoxu.github.io · neural networks wei xu (many slides from greg durrett and...

Documents