rnn review & hierarchical attention...

57
RNN Review & Hierarchical Attention Networks SHANG GAO

Upload: others

Post on 21-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

RNN Review&HierarchicalAttentionNetworksSHANGGAO

Page 2: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

Overview◦ ReviewofRecurrentNeuralNetworks◦ AdvancedRNNArchitectures

◦ Long-Short-Term-Memory◦ GatedRecurrentUnits

◦ RNNsforNaturalLanguageProcessing◦ WordEmbeddings◦ NLPApplications

◦ AttentionMechanisms◦ HierarchicalAttentionNetworks

Page 3: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

FeedforwardNeuralNetworksInaregularfeedforwardnetwork,eachneurontakesininputsfromtheneuronsinthepreviouslayer,andthenpassitsoutputtotheneuronsinthenextlayer

Theneuronsattheendmakeaclassificationbasedonlyonthedatafromthecurrentinput

Page 4: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

RecurrentNeuralNetworksInarecurrentneuralnetwork,eachneurontakesindatafromthepreviouslayerANDitsownoutputfromtheprevioustimestep

TheneuronsattheendmakeaclassificationdecisionbasedonNOTONLYtheinputatthecurrenttimestepBUTALSOtheinputfromalltimestepsbeforeit

Recurrentneuralnetworkscanthuscapturepatternsovertime(e.g.weather,stockmarketdata,speechaudio,naturallanguage)

Page 5: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

RecurrentNeuralNetworksIntheexamplebelow,theneuronatthefirsttimesteptakesinaninputandgeneratesanoutput

TheneuronatthesecondtimesteptakesinaninputANDALSO theoutputfromthefirsttimesteptomakeitsdecision

Theneuronatthethirdtimesteptakesinaninputandalsotheoutputfromthesecondtimestep(whichaccountedfordatafromthefirsttimestep),soitsoutputisaffectedbydatafromboththefirstandsecondtimestep

Page 6: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

RecurrentNeuralNetworksTraditional Neuron: output=sigmoid(weights*input+bias)

Recurrent Neuron:output=sigmoid(weights1*input+weights2*previous_output + bias)oroutput=sigmoid(weights*concat(input,previous_output)+bias)

Page 7: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

Toy RNN ExampleAddingBinary

Ateachtimestep,RNNtakesintwovaluesrepresentingbinaryinput

Ateachtimestep,RNNoutputsthesumofthetwobinaryvaluestakingintoaccountanycarryoverfromprevioustimestep

Page 8: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

ProblemswithBasicRNNsInabasicRNN,newdataiswrittenintoeachcellateverytimestep

Datafromtimestepsveryearlyongetdilutedbecausetheyarewrittenoversomanytimes

Intheexamplebelow,datafromthefirsttimestepisreadintotheRNN

Ateachsubsequenttimestep,theRNNfactorsindatafromthecurrenttimestep

BytheendoftheRNN,thedatafromthefirsttimestephasverylittleimpactontheoutputoftheRNN

Page 9: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

ProblemswithBasicRNNsBasicRNNcellscan’tretaininformationacrossalargenumberoftimesteps

Dependingontheproblem,RNNscanlosedatainasfewas3-5timesteps

Thisiscausesproblemsontaskswhereinformationneedstoberetainedoveralongtime

Forexample,innaturallanguageprocessing,themeaningofapronounmaydependonwhatwasstatedinaprevioussentence

Page 10: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LongShortTermMemoryLongShortTermMemorycellsareadvancedRNNcellsthataddresstheproblemoflong-termdependencies

Insteadofalwayswritingtoeachcellateverytimestep,eachunithasaninternal‘memory’thatcanbewrittentoselectively

Page 11: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LongShortTermMemoryInputfromthecurrenttimestepiswrittentotheinternalmemorybasedonhowrelevantitistotheproblem(relevanceislearnedduringtrainingthroughbackpropagation)

Iftheinputisn’trelevant,nodataiswrittenintothecell

Thiswaydatacanbepreservedovermanytimestepsandberetrievedwhenitisneeded

Page 12: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LongShortTermMemoryMovementofdataintoandoutofanLSTMcelliscontrolledby“gates”

The“forgetgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchoftheinternalmemorytokeepfromtheprevioustimestep

Forexample,attheendofasentence,whena‘.’isencountered,wemaywanttoresettheinternalmemoryofthecell

Page 13: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LongShortTermMemoryThe“candidatevalue”istheprocessedinputvaluefromthecurrenttimestepthatmaybeaddedtomemory◦ Notethattanh activationisusedforthe“candidatevalue”toallowfornegativevaluestosubtractfrommemory

The“inputgate”outputsavaluebetween0(delete)and1(keep)andcontrolshowmuchofthecandidatevalueaddtomemory

Page 14: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LongShortTermMemoryCombined,the“inputgate”and“candidatevalue”determinewhatnewdatagetswrittenintomemory

The“forgetgate”determineshowmuchofthepreviousmemorytoretain

ThenewmemoryoftheLSTMcellisthe“forgetgate”*thepreviousmemorystate+the“inputgate”*the“candidatevalue”fromthecurrenttimestep

Page 15: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LongShortTermMemoryTheLSTMcelldoesnotoutputthecontentsofitsmemorytothenextlayer◦ Storeddatainmemorymightnotberelevantforcurrenttimestep,e.g.,acellcanstoreapronounreferenceandonlyoutputwhenthepronounappears

Instead,an“output”gateoutputsavaluebetween0and1thatdetermineshowmuchofthememorytooutput

Thememorygoesthroughafinaltanh activationbeforebeingpassedtothenextlayer

Page 16: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

GatedRecurrentUnitsGatedRecurrentUnitsareverysimilartoLSTMsbutusetwogatesinsteadofthree

The“updategate”determineshowmuchofthepreviousmemorytokeep

The“resetgate”determineshowtocombinethenewinputwiththepreviousmemory

Theentireinternalmemoryisoutputwithoutanadditionalactivation

Page 17: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTMsvsGRUsGreff,etal.(2015) comparedLSTMsandGRUsandfoundtheyperformaboutthesame

Jozefowicz,etal.(2015) generatedmorethantenthousandvariantsofRNNsanddeterminedthatdependingonthetask,somemayperformbetterthanLSTMs

GRUstrainfasterthanLSTMsbecausetheyarelesscomplex

Generallyspeaking,tuninghyperparameters(e.g.numberofunits,sizeofweights)willprobablyaffectperformancemorethanpickingbetweenGRUandLSTM

Page 18: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

RNNsforNaturalLanguageProcessingThenaturalinputforaneuralnetworkisavectorofnumericvalues(e.g.pixeldensitiesforimagingoraudiofrequencyforspeechrecognition)

Howdoyoufeedlanguageasinputintoaneuralnetwork?

Themostbasicsolutionisonehotencoding◦ Alongvector(equaltothelengthofyourvocabulary)whereeachindexrepresentsonewordinthevocabulary

◦ Foreachword,theindexcorrespondingtothatwordissetto1,andeverythingelseissetto0

Page 19: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

OneHotEncodingLSTMExampleTrainedLSTMtopredictthenextcharactergivenasequenceofcharacters

Trainingcorpus:AllbooksinHitchhiker’sGuidetotheGalaxyseries

One-hotencodingusedtoconverteachcharacterintoavector

72possiblecharacters– lowercaseletters,uppercaseletters,numbers,andpunctuation

Inputvectorisfedintoalayerof256LSTMnodes

LSTMoutputfedintoasoftmax layerthatpredictsthefollowingcharacter

Thecharacterwiththehighestsoftmax probabilityischosenasthenextcharacter

Page 20: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

GeneratedSamples700iterations:aeae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae aeae ae ae ae ae

4200iterations:thesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthesandandthesaidthe

36000iterations:searedtobealittlewasasmallbeachoftheshipwasasmallbeachoftheshipwasasmallbeachoftheship

100000iterations:thesecondthestarsisthestarstothestarsinthestarsthathehadbeensotheshiphadbeensotheshiphadbeen

290000iterations:startedtorunacomputertothecomputertotakeabitofaproblemofftheshipandthesunandtheairwasthesound

500000iterations:"IthinktheGalaxywillbealotofthingsthatthesecondmanwhocouldnotbecontinuallyandthesoundofthestars

Page 21: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

OneHotEncodingShortcomingsOne-hotencodingislackingbecauseitfailstocapturesemanticsimilaritybetweenwords,i.e.,theinherentmeaningofword

Forexample,thewords“happy”,“joyful”,and“pleased”allhavesimilarmeanings,butunderone-hotencodingtheyarethreedistinctandunrelatedentities

Whatifwecouldcapturethemeaningofwordswithinanumericalcontext?

Page 22: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

WordEmbeddingsWordembeddingsarevectorrepresentationsofwordsthatattempttocapturesemanticmeaningEachwordisrepresentedasavectorofnumericalvaluesEachindexinthevectorrepresentssomeabstract“concept”◦ Theseconceptsareunlabeledandlearnedduringtraining

Wordsthataresimilarwillhavesimilarvectors

Masculinity Royality Youth Intelligence

King 0.95 0.95 -0.1 0.6

Queen -0.95 0.95 -0.1 0.6

Prince 0.8 0.8 0.7 0.4

Woman -0.95 0.01 -0.1 0.2

Peasant 0.1 -0.95 0.1 -0.3

Doctor 0.12 0.1 -0.2 0.95

Page 23: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

Word2VecWordsthatappearinthesamecontextaremorelikelytohavethesamemeaning◦ Iamexcited toseeyoutoday!◦ Iamecstatic toseeyoutoday!

Word2Vecisanalgorithmthatusesafunnel-shapedsinglehiddenlayerneuralnetwork(similartoautoencoder)tocreatewordembeddings

Givenaword(inone-hotencodedformat),ittriestopredicttheneighborsofthatword(alsoinone-hotencodedformat),orviceversa

Wordsthatappearinthesamecontextwillhavesimilarembeddings

Page 24: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

Word2VecThemodelistrainedonalargecorpusoftextusingregularbackpropagation

Foreachwordinthecorpus,predictthe5wordstotheleftandright(orviceversa)

Oncethemodelistrained,theembeddingforaparticularwordistherowoftheweightmatrixassociatedwiththatword

Manypretrainedvectors(e.g.Google)canbedownloadedonline

Page 25: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

Word2Vecon20Newsgroups

Page 26: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

BasicDeepLearningNLPPipelineGenerateWordEmbeddings◦ Pythongensim package

FeedwordembeddingsintoLSTMorGRUlayer

FeedoutputofLSTMorGRUlayerintosoftmax classifier

Page 27: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

NLPApplicationsforRNNsLanguageModels◦ Givenaseriesofwords,predictthenextword◦ Understandtheinherentpatternsinagivenlanguage◦ Usefulforautocompletionandmachinetranslation

SentimentAnalysis◦ Givenasentenceordocument,classifyifitispositiveornegative◦ Usefulforanalyzingthesuccessofaproductlaunchorautomatedstocktradingbasedoffnews

Otherformstextclassification◦ Cancerpathologyreportclassification

Page 28: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

AdvancedApplicationsQuestionAnswering◦ Readadocumentandthenanswerquestions◦ ManymodelsuseRNNsastheirfoundation

AutomatedImageCaptioning◦ Givenanimage,automaticallygenerateacaption

◦ ManymodelsusebothCNNsandRNNs

MachineTranslation◦ Automaticallytranslatetextfromonelanguagetoanother

◦ Manymodels(includingGoogleTranslate)useRNNsastheirfoundation

Page 29: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsBi-directionalLSTMsSometimes,importantcontextforawordcomesaftertheword(especiallyimportanttranslation)◦ Isawacrane flyingacrossthesky◦ Isawacrane liftingalargeboulder

Solution- usetwoLSTMlayers,onethatreadstheinputforwardandonethatreadstheinputbackwards,andconcatenatetheiroutputs

Page 30: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanismsSometimesonlyafewwordsinasentenceordocumentareimportantandtherestdonotcontributeasmuchmeaning◦ Forexample,whenclassifyingcancerlocationfromcancerpathologyreports,wemayonlycareaboutcertainkeywordslike“rightupperlung”or“ovarian”

InatraditionalRNN,weusuallytaketheoutputatthelasttimestep

Bythelasttimestep,informationfromtheimportantwordsmayhavebeendiluted,evenwithLSTMsandGRUsunits

Howcanwecapturetheinformationatthemostimportantwords?

Page 31: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanismsNaïvesolution:topreventinformationloss,insteadofusingtheLSTMoutputatthelasttimestep,taketheLSTMoutputateverytimestepandusetheaverage

Bettersolution:findtheimportanttimesteps,andweighttheoutputatthosetimestepsmuchhigherwhendoingtheaverage

Page 32: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanismsAnattentionmechanismcalculateshowimportanttheLSTMoutputateachtimestepis

It’sasimplefeedforwardnetworkwithasingle(tanh)hiddenlayerandasoftmax output

Ateachtimestep,feedtheoutputfromtheLSTM/GRUintotheattentionmechanism

Page 33: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanisms

Oncetheattentionmechanismhasallthetimesteps,itcalculatesasoftmaxoverallthetimesteps◦ softmax alwaysaddsto1

Thesoftmax tellsushowtoweighttheoutputateachtimestep,i.e.,howimportanteachtimestepis

Multiplytheoutputateachtimestepwithitscorrespondingsoftmax weightandaddtocreateaweightedaverage

Page 34: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanisms

Attentionmechanismscantakeintoaccount“context”todeterminewhat’simportant

Rememberdotproductisameasureofsimilarity– twovectorsthataresimilarwillhavelargerdotproduct

Innormalsoftmax,dotproductinputwithrandomlyinitializedweightsbeforeapplyingsoftmax function

Page 35: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanisms

Instead,wecandotproductwithavectorthatrepresents“context”tofindwordsmostsimilar/relevanttothatcontext:◦ Forquestionanswering,canrepresentaquestionbeingasked

◦ Formachinetranslation,canrepresentthepreviousword

◦ Forclassification,canbeinitializedrandomlyandlearnedduringtraining

Page 36: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanisms

Withattention,youcanvisualizehowimportanteachtimestepisforaparticulartask

Page 37: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

LSTM ImprovementsAttentionMechanisms

Withattention,youcanvisualizehowimportanteachtimestepisforaparticulartask

Page 38: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

CNNsforTextClassificationStartwithWordEmbeddings◦ Ifyouhave10words,andyourembeddingsizeis300,you’llhavea10x300matrix

3ParallelConvolutionLayers◦ Takeinwordembeddings◦ Slidingwindowthatprocesses3,4,and5wordsatatime(1Dconv)

◦ Filtersizesare3x300x100,4x300x100,and5x300x100(width,in-channels,out-channels)

◦ Eachconvlayeroutputs10x100matrix

Page 39: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

CNNsforTextClassificationMaxpool andConcatenate◦ Foreachfilterchannel,maxpoolacrosstheentirewidthofsentence

◦ Thisislikepickingthe‘mostimportant’wordinthesentenceforeachchannel

◦ Alsoensureseverysentence,nomatterhowlong,isrepresentedbysamelengthvector

◦ Foreachofthethree10x100matrices,returns1x100matrix

◦ Concatenatethethree1x100matricesintoa1x300matrix

DenseandSoftmax

Page 40: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

HierarchicalAttentionNetworks

Page 41: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

ProblemOverviewNationalCancerInstitutehasaskedOakRidgeNationalLabtodevelopaprogramthatcanautomaticallyclassifycancerpathologyreports

Pathologyreportsarewhatdoctorswriteupwhentheydiagnosecancer,andNCIusesthemtocalculatenationalstatisticsandtrackhealthtrends

Challenges:◦ Differentdoctorsusedifferentterminologytolabelthesametypesofcancer◦ Somediagnosesmayreferenceothertypesofcancerorotherorgansthatarenottheactualcancerbeingdiagnosed

◦ Typos

Task:givenapathologyreport,teachaprogramtofindthetypeofcancer,locationofcancer,histologicalgrade,etc

Page 42: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

ApproachTheperformanceofvariousdifferentclassifiersweretested:◦ Traditionalmachinelearningclassifiers:NaiveBayes,LogisticRegression,SupportVectorMachines,RandomForests,andXG-Boost

◦ Traditionalmachinelearningclassifiersrequiremanuallydefinedfeatures,suchn-gramsandtf-idf

◦ Deeplearningmethods:recurrentneuralnetworks,convolutionalneuralnetworks,andhierarchicalattentionnetworks

◦ Givenenoughdata,deeplearningmethodscanlearntheirownfeatures,suchaswhichwordsorphrasesareimportant

TheHierarchicalAttentionNetworkisarelativelynewdeeplearningmodelthatcameoutlastyearandisoneofthetopperformers

Page 43: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

HANArchitectureTheHierarchicalAttentionNetwork(HAN)isadeeplearningmodelfordocumentclassification

BuiltfrombidirectionalRNNscomposedofGRUs/LSTMswithattentionmechanisms

Composedof“hierarchies”wheretheoutputsofthelowerhierarchiesbecometheinputstotheupperhierarchies

Page 44: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

HANArchitectureBeforefeedingadocumentintotheHAN,wefirstbreakitdownintosentences(orinourcase,lines)

Thewordhierarchyisresponsibleforcreatingsentenceembeddings◦ Thishierarchyreadsinonefullsentenceatime,intheformofwordembeddings

◦ Theattentionmechanismselectsthemostimportantwords

◦ Theoutputisasentenceembeddingthatcapturesthesemanticcontentofthesentencebasedonthemostimportantwords

Page 45: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

HANArchitectureThesentencehierarchyisresponsibleforcreatingthefinaldocumentembedding◦ Identicalstructurewiththewordhierarchy◦ Readsinthesentenceembeddingsoutputfromthewordhierarchy

◦ Theattentionmechanismselectsthemostimportantsentence

◦ Theoutputisadocumentembeddingrepresentingthemeaningoftheentiredocument

Thefinaldocumentembeddingisusedforclassification

Page 46: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

ExperimentalSetup945cancerpathologyreports,allcasesofbreastandlungcancer

10– foldcrossvalidationused,30epochsperfold

Hyperparameteroptimizationappliedonmodelstofindoptimalparameters

Twomaintasks– primarysiteclassificationandhistologicalgradeclassification◦ Unevenclassdistribution,somewithonly~10occurrencesindataset◦ F-scoreusedforperformancemetric◦ MicroF-scoreisweightedF-scoreaveragebasedonclasssize◦ MacroF-scoreisunweightedF-scoreaverageacrossallclasses

Page 47: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

HAN PerformancePrimarySite12possiblecancersubsitelocations◦ 5lungsubsites◦ 7breastsubsites

DeeplearningmethodsoutperformedalltraditionalMLmethodsexceptforXGBoost

HANhadbestperformance,pretrainingimprovedperformanceevenfurther

TraditionalMachineLearningClassifiers

Classifier PrimarySiteMicroF-Score

PrimarySiteMacroF-Score

NaiveBayes .554(.521,.586)

.161(.152,.170)

LogisticRegression .621(.589,.652)

.222(.207,.237)

SupportVectorMachine(C=1,gamma=1)

.616(.585,.646)

.220(.205,.234)

RandomForest(numtrees=100)

.628(.597,.661)

.258(.236,.283)

XGBoost(maxdepth=5,nestimators=300)

.709(.681,.738)

.441(.404,.474)

DeepLearningClassifiers

RecurrentNeuralNetwork(withattentionmechanism)

.694(.666,.722)

.468(.432,.502)

ConvolutionalNeuralNetwork .712(.680,.736)

.398(.359,.434)

HierarchicalAttentionNetwork(nopretraining)

.784(.759,.810)

.566(.525,.607)

HierarchicalAttentionNetwork(withpretraining)

.800(.776,.825)

.594(.553,.636)

Page 48: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

HAN PerformanceHistologicalGrade4possiblehistologicalgrades◦ 1-4,indicatinghowabnormaltumorcellsandtumortissueslookunderamicroscopewith4beingmostabnormal

◦ Indicateshowquicklyatumorislikelytogrowandspread

OtherthanRNNs,deeplearningmodelsgenerallyoutperformtraditionalMLmodels

HANhadbestperformance,butpretrainingdidnothelpperformance

TraditionalMachineLearningClassifiers

Classifier HistologicalGradeMicroF-Score

HistologicalGradeMacroF-Score

NaiveBayes .481(.442,.519)

.264(.244,.283)

LogisticRegression .540(.499,.576)

.340(.309,.371)

SupportVectorMachine(C=1,gamma=1)

.520(.482,.558)

.330(.301,.357)

RandomForest(numtrees=100)

.597(.558,.636)

.412(.364,.476)

XGBoost(maxdepth=5,nestimators=300)

.673(.634,.709)

.593(.516,.662)

DeepLearningClassifiers

RecurrentNeuralNetwork(withattentionmechanism)

.580(.541,.617)

.474(.416,.536)

ConvolutionalNeuralNetwork .716(.681,.750)

.521(.493,.548)

HierarchicalAttentionNetwork(nopretraining)

.916(.895,.936)

.841(.778,.895)

HierarchicalAttentionNetwork(withpretraining)

.904(.881,.927)

.822(.744,.883)

Page 49: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

TFIDFDocumentEmbeddings

TFIDF-weightedWord2Vecembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.

Page 50: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

TFIDFDocumentEmbeddings

HANdocumentembeddingsreducedto2dimensionsviaPCAfor(A.)primarysitetrainreports,(B.)histologicalgradetrainreports,(C.)primarysitetestreports,and(D.)histologicalgradetrainreports.

Page 51: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

PretrainingWehaveaccesstomoreunlabeleddatathanlabeleddata(approximately1500unlabeled,1000labeled)

Toutilizedunlabeleddata,wetrainedourHANtocreatedocumentembeddingsthatmatchedthecorrespondingTF-IDFweightedwordembeddingsforthatdocument

HANtrainingandvalidationaccuracywithandwithoutpretrainingfor(A.)primarysitetaskand(B.)histologicalgradetask

Page 52: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

HANDocumentAnnotations

Page 53: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

MostImportantWordsperTaskWecanalsousetheHAN’sattentionweightstofindthewordsthatcontributemosttowardstheclassificationtaskathand:

PrimarySite HistologicalGrademainstemadenocalullowerbreast

carinacusauppermiddlerul

buttocktemporalupperretrosputum

poorlyg2highiiidlr

Undifferentiatedg3iiig1

moderatelyintermediatewellarising2

Page 54: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

ScalingRelativetoothermodels,HANisveryslowtotrain◦ OnCPU,HANtakesapproximately4hourstogothrough30epochs◦ Incomparison,CNNtakesaround40minutestogothrough30epochs,andtraditionalmachinelearningclassifierstakelessthanaminute

◦ TheHANisslowduetoitscomplexarchitectureanduseofRNNs,sogradientsareveryexpensivetocompute

WearecurrentworkingtoscaletheHANtorunonmultipleGPUs◦ OnTensorflow,RNNsonGPUrunslowerthanonCPU◦ WeareconsideringexploringaPyTorch implementationtogetaroundthisproblem

WehavesuccessfullydevelopedadistributedCPU-onlyHANthatrunsonTITANusingMPI,with4xspeedupon8nodes

Page 55: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

AttentionisAllYouNeedNewpaperthatcameoutJune2017fromGoogleBrain,inwhichtheyshowedtheycouldgetcompetitiveresultsinmachinetranslationwithonlyattentionmechanismsandnoRNNs

WeappliedthesamearchitecturetoreplacetheRNNsinourHAN

Becauseattentionmechanismsarejustmatrixmultiplications,itrunsabout10xfasterthantheHANwithRNNs

ThisnewmodelperformsalmostaswellastheHANwithRNNs– 0.77micro-Fonprimarysite(comparedto0.78inoriginalHAN),and0.86micro-Fonhistologicalgrade(comparedto0.91inoriginalHAN)

BecausenoRNNsareutilized,thismodelismucheasiertoscaleontheGPU

Page 56: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

OtherFutureWorkMultitaskLearning◦ Predicthistologicalgrade,primarysite,andothertaskssimultaneouslywithinthesamemodel◦ Hopefullyboosttheperformanceofalltasksbysharinginformationacrosstasks

Semi-SupervisedLearning◦ Utilizeunlabeleddataduringtrainingratherthaninpretrainingwiththegoalofimprovingclassificationperformance

◦ Thistaskischallengingbecauseinmostsemi-supervisedtasks,weknowallthelabelswithinthedataset.Inourcase,weonlyhaveasubsetofthelabels.

Page 57: RNN Review & Hierarchical Attention Networksweb.eecs.utk.edu/~hqi/deeplearning/lecture17-rnn-han.pdfWord Embeddings NLP Applications Attention Mechanisms Hierarchical Attention Networks

Questions?