cs395t: structured models for nlp lecture 14: neural network … · 2017. 10. 17. · r 6...

10
CS395T: Structured Models for NLP Lecture 14: Neural Network Implementa@on Greg Durrett Administrivia Project 2 due today Project 3 out today Project zip contains sample Tensorflow code demonstra@ng what’s in today’s lecture Sen@ment analysis using feedforward neural networks plus your choice of RNNs or CNNs Recall: Feedforward NNs V n features d hidden units d x n matrix m x d matrix soQmax W f (x) z nonlinearity (tanh, relu, …) g P (y|x) P (y|x) = softmax(Wg(Vf (x))) Recall: Backpropaga@on V d hidden units soQmax W f (x) z g P (y|x) P (y|x) = softmax(Wg(Vf (x))) @L @W err(root) @z @V err(z) z f (x)

Upload: others

Post on 09-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • CS395T:StructuredModelsforNLPLecture14:NeuralNetwork

    Implementa@on

    GregDurrett

    Administrivia‣ Project2duetoday

    ‣ Project3outtoday

    ‣ ProjectzipcontainssampleTensorflowcodedemonstra@ngwhat’sintoday’slecture

    ‣ Sen@mentanalysisusingfeedforwardneuralnetworksplusyourchoiceofRNNsorCNNs

    Recall:FeedforwardNNs

    V

    nfeatures

    dhiddenunits

    dxnmatrix mxdmatrix

    soQmaxWf(x)

    z

    nonlinearity
(tanh,relu,…)

    g P(y

    |x)

    P (y|x) = softmax(Wg(V f(x)))

    Recall:Backpropaga@on

    V

    dhiddenunits

    soQmaxWf(x)

    zg P

    (y|x)

    P (y|x) = softmax(Wg(V f(x)))

    @L@W err(root)@z

    @Verr(z)

    zf(x)

  • ThisLecture

    ‣ Training

    ‣ Implementa@ondetails

    ‣Wordrepresenta@ons Implementa@onDetails

    Computa@onGraphs

    ‣ Compu@nggradientsishard!

    ‣ Automa@cdifferen@a@on:instrumentcodetokeeptrackofderiva@ves

    x = x * x (x,dx) = (x * x, 2 * x * dx)codegen

    ‣ Inprac@ce:needotheropera@ons,wantmorecontrol->useanexternalcomputa@ongraphlibrary

    Computa@onGraphs

    http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

    ‣ Definecomputa@onabstractly,intermsofsymbols

    ‣ Cancomputegradientsofcwithrespectto(x,y,z)easily

    ‣ Usefulabstrac@on:supportsbothCPUandGPUimplementa@ons

    ‣ Disadvantage:higher-levelspecifica@on,sohardtocontrolmemoryalloca@onandlow-levelimplementa@ondetails

  • Tensorflow

    http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

    x = tf.placeholder(“x”)

    y = tf.placeholder(“y”)

    z = tf.placeholder(“z”)

    a = tf.add(x, y)

    b = tf.multiply(a, z)c = tf.add(b, a)

    output = sess.run([a,c], 
 dict_of_input_values)

    with tf.Session() as sess:

    Computa@onGraph:FFNN

    P (y|x) = softmax(Wg(V f(x)))fx = tf.placeholder(tf.float32, feat_vec_size)V = tf.get_variable(“V”, [hidden_size, feat_vec_size])z = tf.sigmoid(tf.tensordot(V, fx, 1))

    probs = tf.nn.softmax(tf.tensordot(W, z, 1))

    ‣ placeholder:inputtothesystem;variable:parametertolearn

    W = tf.get_variable(“W”, [num_classes, hidden_size])

    Computa@onGraph:FFNN

    P (y|x) = softmax(Wg(V f(x)))fx = tf.placeholder(tf.float32, feat_vec_size)V = tf.get_variable(“V”, [hidden_size, feat_vec_size])z = tf.sigmoid(tf.tensordot(V, fx, 1))

    probs = tf.nn.softmax(tf.tensordot(W, z, 1))W = tf.get_variable(“W”, [num_classes, hidden_size])

    ‣ Shortcuthelpermethodsexistlike tf.nn.softmax_cross_entropy_with_logits

    loss = tf.negative(tf.log(tf.tensordot(probs, label, 1)))

    ‣ TensorflowcancomputegradientsforWandVbasedonloss

    label = tf.placeholder(tf.int32, num_classes)

    TrainingaModelDefineacomputa@ongraph

    Defineanoperatorthatupdatestheparametersbasedonanexample

    Foreachepoch:

    Evaluatethetrainingoperatorontheexample

    Foreachexample:

    Decodetestset

  • Batching

    ‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera@ons,leadstobeberlearningoutcomestoo

    ‣ Needtomakethecomputa@ongraphprocessabatchatthesame@mefx = tf.placeholder(tf.float32, [batch_size, feat_vec_size])

    V = tf.get_variable(“V”, [hidden_size, feat_vec_size])

    z = tf.sigmoid(tf.tensordot(V, fx, [1,1])) # batch_size x hidden_size

    ...

    loss = [sum over losses from batch]

    BatchTrainingaModelDefineacomputa@ongraphtoprocessabatchofdata

    Defineanoperatorthatupdatestheparametersbasedonabatch

    Foreachepoch:

    Evaluatethetrainingoperatoronthebatch

    Foreachbatch:

    Decodetestsetinbatches

    TrainingTips

    TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

    ‣ Howtoini@alize?Howtoregularize?Whatop@mizertouse?

    ‣ Thislecture:[email protected]@miza@oncoursestounderstandthisfurther

  • Howdoesini@aliza@onaffectlearning?

    V

    nfeatures

    dhiddenunits

    dxnmatrix mxdmatrix

    soQmaxWf(x)

    z

    nonlinearity
(tanh,relu,…)

    g P(y

    |x)

    P (y|x) = softmax(Wg(V f(x)))

    ‣Howdoweini@alizeVandW?Whatconsequencesdoesthishave?

    ‣ Whyisitimportanttohavesmallac@va@ons?

    ‣ Ifcellac@va@onsaretoolargeinabsolutevalue,gradientsaresmalltoo‣ ReLU:largerdynamicrange(allposi@venumbers),butcanproducebigvalues,canbreakdownifeverythingistoonega@ve

    Howdoesini@aliza@onaffectlearning?

    Ini@aliza@on1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

    ‣ Candorandomuniform/normalini@aliza@onwithappropriatescale

    U

    "�r

    6

    fan-in + fan-out

    ,+

    r6

    fan-in + fan-out

    #‣ Glorotini@alizer:

    ‣ Wantvarianceofinputsandgradientsforeachlayertobethesame‣ Batchnormaliza@on(IoffeandSzegedy,2015):periodicallyshiQ+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

    2)Ini@alizetoolargeandcellsaresaturated

    Dropout‣ Probabilis@callyzerooutpartsofthenetworkduringtrainingtopreventoverfikng,usewholenetworkattest@me

    Srivastavaetal.(2014)

    ‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

    ‣ Formofstochas@cregulariza@on

  • Dropout

    ‣ Intensorflow:implementedasanaddi@onallayerinanetwork

    hidden_dropped_out = tf.nn.dropout(hidden, dropout_keep_prob)

    ‣ OQenuselowdropout(keepavaluewithprobability0.8)attheinputandmoderatedropout(keepwithprobability0.5)internallyinfeedforwardnetworks(notinRNNs)

    Op@mizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused‣ Adap@vestepsizelikeAdagrad,incorporatesmomentum

    Op@mizer‣ Wilsonetal.NIPS2017:adap@vemethodscanactuallyperformbadlyattest@me(Adamisinpink,SGDinblack)‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

    Visualiza@onwithTensorboard‣ Visualizethecomputa@ongraphandlogsoftheobjec@veover@me

  • StructuredPredic@on‣ Fourelementsofastructuredmachinelearningmethod:‣ Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

    ‣ Objec@ve:manylossfunc@onslooksimilar,justchangesthelastlayeroftheneuralnetwork

    ‣ Inference:definethenetwork,Tensorflow
takescareofit(mostly…)

    ‣ Training:lotsofchoicesforop@miza@on/hyperparameters

    WordRepresenta@ons

    WordRepresenta@ons

    ‣ Con@nuousmodelexpectscon@nuousseman@csfrominput‣ Neuralnetworksworkverywellatcon@nuousdata,butwordsarediscrete

    WordEmbeddings

    Bothaetal.(2017)

    Fedraisesinterestratesinorderto…

    f(x)?? em

    b(raises)

    ‣ Wordembeddingsforeachwordforminput emb(interest)

    emb(rates)

    previousword

    currword

    nextword

    otherwords,feats,etc.

    ‣ Part-of-speechtaggingwithFFNNs

  • goodenjoyable

    bad

    dog

    great

    is

    ‣ Wantavectorspacewheresimilarwordshavesimilarembeddings

    themoviewasgreat

    themoviewasgood

    ~~

    WordEmbeddings WordRepresenta@ons

    ‣ Con@nuousmodelexpectscon@nuousseman@csfrominput‣ “Cantellawordbythecompanyitkeeps”Firth1957

    ‣ Neuralnetworksworkverywellatcon@nuousdata,butwordsarediscrete

    Con@nuousBag-of-Words‣ Predictwordfromcontext thedogbittheman

    ‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)

    dog

    the

    +

    sum,sizedP (w|w�1, w+1)

    soQmaxMul@ply
byW

    gold=bit

    ‣ Maximizelikelihoodofgoldlabels(nomanuallabelingrequired!)

    Mikolovetal.(2013)

    Skip-Gramthedogbittheman‣ Predictonewordofcontextfromword

    bit

    soQmaxMul@ply
byW

    gold=dog

    ‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)

    Mikolovetal.(2013)

    ‣ Anothertrainingexample:bit->the

  • Skip-GramwithNega@veSamplingthedogbittheman‣ Problem:wanttotrainon1B+words,


    mul@plyingby|V|xdmatrixforeachistoo 
expensive

    ‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)Mikolovetal.(2013)

    (bit,the)=>+1(bit,dog)=>+1

    (bit,a)=>-1(bit,fish)=>-1

    ‣ Solu@on:take(word,context)pairsandclassifythemas“real”ornot.Createrandomnega@veexamplesbysampling

    wordsinsimilarcontextsselectforsimilarcvectorsP (pos|w, c) = e

    w·c

    ew·c + 1

    Regulari@esinVectorSpace

    queenking

    woman

    man

    (king-man)+woman=queen

    ‣Whywouldthisbe?

    ‣ woman-mancapturesthedifferencein
thecontextsthattheseoccurin

    king+(woman-man)=queen

    ‣Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

    GloVe‣ Wordco-occurrencesarewhatmaberdirectly

    ‣ Weightedleast-squaresproblemtodirectlypredictwordco-occurrencematrix(likematrixfactoriza@on)

    Penningtonetal.(2014)

    UsingWordEmbeddings

    ‣ Approach1:learnembeddingsasparametersfromyourdata‣ Approach2:ini@alizeusingGloVe/CBOW/SGNS,keepfixed

    ‣ Approach3:ini@alizeusingGloVe/CBOW/SGNS,fine-tune

    # Indexed sentence of length sent_len, e.g.: [12, 36, 47, 8]input_words = tf.placeholder(tf.int32, [sent_len])encoder = tf.get_variable("embed", [voc_size, embedding_size])embedded_input_words = tf.nn.embedding_lookup(encoder, input_words)# embedded_input_words: sent_len x embedding_size tensor

    ‣ Fasterbecausenoneedtoupdatetheseparameters

    ‣ Typicallyworksbest

  • Takeaways‣ Lotstotunewithneuralnetworks

    ‣Wordvectors:variouschoicesofpre-trainedvectorsworkwellasini@alizers

    ‣ Training:op@mizer,ini@alizer,regulariza@on(dropout),…‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

    ‣Next@me:RNNs/LSTMs/GRUs