cs395t: structured models for nlp lecture 14: neural network … · 2017. 10. 17. · r 6...

CS395T:StructuredModelsforNLPLecture14:NeuralNetwork

Implementa@on

GregDurrett

Administrivia‣ Project2duetoday

‣ Project3outtoday

‣ ProjectzipcontainssampleTensorflowcodedemonstra@ngwhat’sintoday’slecture

‣ Sen@mentanalysisusingfeedforwardneuralnetworksplusyourchoiceofRNNsorCNNs

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soQmaxWf(x)

z

nonlinearity (tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

Recall:Backpropaga@on

V

dhiddenunits

soQmaxWf(x)

zg P

(y|x)


@L@W err(root)@z

@Verr(z)

zf(x)

ThisLecture

‣ Training

‣ Implementa@ondetails

‣Wordrepresenta@ons Implementa@onDetails

Computa@onGraphs

‣ Compu@nggradientsishard!

‣ Automa@cdifferen@a@on:instrumentcodetokeeptrackofderiva@ves

x = x * x (x,dx) = (x * x, 2 * x * dx)codegen

‣ Inprac@ce:needotheropera@ons,wantmorecontrol->useanexternalcomputa@ongraphlibrary

Computa@onGraphs

http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

‣ Definecomputa@onabstractly,intermsofsymbols

‣ Cancomputegradientsofcwithrespectto(x,y,z)easily

‣ Usefulabstrac@on:supportsbothCPUandGPUimplementa@ons

‣ Disadvantage:higher-levelspecifica@on,sohardtocontrolmemoryalloca@onandlow-levelimplementa@ondetails

Tensorflow

http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

x = tf.placeholder(“x”)

y = tf.placeholder(“y”)

z = tf.placeholder(“z”)

a = tf.add(x, y)

b = tf.multiply(a, z)c = tf.add(b, a)

output = sess.run([a,c],   dict_of_input_values)

with tf.Session() as sess:

Computa@onGraph:FFNN

P (y|x) = softmax(Wg(V f(x)))fx = tf.placeholder(tf.float32, feat_vec_size)V = tf.get_variable(“V”, [hidden_size, feat_vec_size])z = tf.sigmoid(tf.tensordot(V, fx, 1))

probs = tf.nn.softmax(tf.tensordot(W, z, 1))

‣ placeholder:inputtothesystem;variable:parametertolearn

W = tf.get_variable(“W”, [num_classes, hidden_size])

Computa@onGraph:FFNN

P (y|x) = softmax(Wg(V f(x)))fx = tf.placeholder(tf.float32, feat_vec_size)V = tf.get_variable(“V”, [hidden_size, feat_vec_size])z = tf.sigmoid(tf.tensordot(V, fx, 1))

probs = tf.nn.softmax(tf.tensordot(W, z, 1))W = tf.get_variable(“W”, [num_classes, hidden_size])

‣ Shortcuthelpermethodsexistlike tf.nn.softmax_cross_entropy_with_logits

loss = tf.negative(tf.log(tf.tensordot(probs, label, 1)))

‣ TensorflowcancomputegradientsforWandVbasedonloss

label = tf.placeholder(tf.int32, num_classes)

TrainingaModelDefineacomputa@ongraph

Defineanoperatorthatupdatestheparametersbasedonanexample

Foreachepoch:

Evaluatethetrainingoperatorontheexample

Foreachexample:

Decodetestset

Batching

‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera@ons,leadstobeberlearningoutcomestoo

‣ Needtomakethecomputa@ongraphprocessabatchatthesame@mefx = tf.placeholder(tf.float32, [batch_size, feat_vec_size])

V = tf.get_variable(“V”, [hidden_size, feat_vec_size])

z = tf.sigmoid(tf.tensordot(V, fx, [1,1])) # batch_size x hidden_size

...

loss = [sum over losses from batch]

BatchTrainingaModelDefineacomputa@ongraphtoprocessabatchofdata

Defineanoperatorthatupdatestheparametersbasedonabatch

Foreachepoch:

Evaluatethetrainingoperatoronthebatch

Foreachbatch:

Decodetestsetinbatches

TrainingTips

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

‣ Howtoini@alize?Howtoregularize?Whatop@mizertouse?

‣ Thislecture:[email protected]@miza@oncoursestounderstandthisfurther

Howdoesini@aliza@onaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soQmaxWf(x)

z

nonlinearity (tanh,relu,…)

g P(y

|x)


‣Howdoweini@alizeVandW?Whatconsequencesdoesthishave?

‣ Whyisitimportanttohavesmallac@va@ons?

‣ Ifcellac@va@onsaretoolargeinabsolutevalue,gradientsaresmalltoo‣ ReLU:largerdynamicrange(allposi@venumbers),butcanproducebigvalues,canbreakdownifeverythingistoonega@ve

Howdoesini@aliza@onaffectlearning?

Ini@aliza@on1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normalini@aliza@onwithappropriatescale

U

"�r

6

fan-in + fan-out

,+

r6

fan-in + fan-out

#‣ Glorotini@alizer:

‣ Wantvarianceofinputsandgradientsforeachlayertobethesame‣ Batchnormaliza@on(IoffeandSzegedy,2015):periodicallyshiQ+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

2)Ini@alizetoolargeandcellsaresaturated

Dropout‣ Probabilis@callyzerooutpartsofthenetworkduringtrainingtopreventoverfikng,usewholenetworkattest@me

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ Formofstochas@cregulariza@on

Dropout

‣ Intensorflow:implementedasanaddi@onallayerinanetwork

hidden_dropped_out = tf.nn.dropout(hidden, dropout_keep_prob)

‣ OQenuselowdropout(keepavaluewithprobability0.8)attheinputandmoderatedropout(keepwithprobability0.5)internallyinfeedforwardnetworks(notinRNNs)

Op@mizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused‣ Adap@vestepsizelikeAdagrad,incorporatesmomentum

Op@mizer‣ Wilsonetal.NIPS2017:adap@vemethodscanactuallyperformbadlyattest@me(Adamisinpink,SGDinblack)‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

Visualiza@onwithTensorboard‣ Visualizethecomputa@ongraphandlogsoftheobjec@veover@me

StructuredPredic@on‣ Fourelementsofastructuredmachinelearningmethod:‣ Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ Objec@ve:manylossfunc@onslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,Tensorflow takescareofit(mostly…)

‣ Training:lotsofchoicesforop@miza@on/hyperparameters

WordRepresenta@ons

WordRepresenta@ons

‣ Con@nuousmodelexpectscon@nuousseman@csfrominput‣ Neuralnetworksworkverywellatcon@nuousdata,butwordsarediscrete

WordEmbeddings

Bothaetal.(2017)

…

Fedraisesinterestratesinorderto…

f(x)?? em

b(raises)

‣ Wordembeddingsforeachwordforminput emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

goodenjoyable

bad

dog

great

is

‣ Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

~~

WordEmbeddings WordRepresenta@ons

‣ Con@nuousmodelexpectscon@nuousseman@csfrominput‣ “Cantellawordbythecompanyitkeeps”Firth1957

‣ Neuralnetworksworkverywellatcon@nuousdata,butwordsarediscrete

Con@nuousBag-of-Words‣ Predictwordfromcontext thedogbittheman

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)

dog

the

+

sum,sizedP (w|w�1, w+1)

soQmaxMul@ply byW

gold=bit

‣ Maximizelikelihoodofgoldlabels(nomanuallabelingrequired!)

Mikolovetal.(2013)

Skip-Gramthedogbittheman‣ Predictonewordofcontextfromword

bit

soQmaxMul@ply byW

gold=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)

Mikolovetal.(2013)

‣ Anothertrainingexample:bit->the

Skip-GramwithNega@veSamplingthedogbittheman‣ Problem:wanttotrainon1B+words, 

mul@plyingby|V|xdmatrixforeachistoo  expensive

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)Mikolovetal.(2013)

(bit,the)=>+1(bit,dog)=>+1

(bit,a)=>-1(bit,fish)=>-1

‣ Solu@on:take(word,context)pairsandclassifythemas“real”ornot.Createrandomnega@veexamplesbysampling

wordsinsimilarcontextsselectforsimilarcvectorsP (pos|w, c) = e

w·c

ew·c + 1

Regulari@esinVectorSpace

queenking

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferencein thecontextsthattheseoccurin

king+(woman-man)=queen

‣Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

GloVe‣ Wordco-occurrencesarewhatmaberdirectly

‣ Weightedleast-squaresproblemtodirectlypredictwordco-occurrencematrix(likematrixfactoriza@on)

Penningtonetal.(2014)

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata‣ Approach2:ini@alizeusingGloVe/CBOW/SGNS,keepfixed

‣ Approach3:ini@alizeusingGloVe/CBOW/SGNS,fine-tune

# Indexed sentence of length sent_len, e.g.: [12, 36, 47, 8]input_words = tf.placeholder(tf.int32, [sent_len])encoder = tf.get_variable("embed", [voc_size, embedding_size])embedded_input_words = tf.nn.embedding_lookup(encoder, input_words)# embedded_input_words: sent_len x embedding_size tensor

‣ Fasterbecausenoneedtoupdatetheseparameters

‣ Typicallyworksbest

Takeaways‣ Lotstotunewithneuralnetworks

‣Wordvectors:variouschoicesofpre-trainedvectorsworkwellasini@alizers

‣ Training:op@mizer,ini@alizer,regulariza@on(dropout),…‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣Next@me:RNNs/LSTMs/GRUs

cs395t: structured models for nlp lecture 14: neural network … · 2017. 10. 17. · r 6...

Documents