& lda2vec topic mo def65b0b05ling with lsa ,...
TRANSCRIPT
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 1/12
TopicModelingwithLSA,PLSA,LDA&lda2Vec
JoyceXuinNanoNets Follow
May25,2018·12minread
ThisarticleisacomprehensiveoverviewofTopicModelinganditsassociatedtechniques.
Innaturallanguageunderstanding(NLU)tasks,thereisahierarchyoflensesthroughwhichwecan
extractmeaning—fromwordstosentencestoparagraphstodocuments.Atthedocumentlevel,
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 2/12
oneofthemostusefulwaystounderstandtextisbyanalyzingitstopics.Theprocessoflearning,
recognizing,andextractingthesetopicsacrossacollectionofdocumentsiscalledtopicmodeling.Inthispost,wewillexploretopicmodelingthrough4ofthemostpopulartechniquestoday:LSA,
pLSA,LDA,andthenewer,deeplearning-basedlda2vec.OverviewAlltopicmodelsarebasedonthesamebasicassumption:
eachdocumentconsistsofamixtureoftopics,andeachtopicconsistsofacollectionofwords.
Inotherwords,topicmodelsarebuiltaroundtheideathatthesemanticsofourdocumentare
actuallybeinggovernedbysomehidden,or“latent,”variablesthatwearenotobserving.Asa
result,thegoaloftopicmodelingistouncovertheselatentvariables—topics—thatshapethe
meaningofourdocumentandcorpus.Therestofthisblogpostwillbuildupanunderstandingof
howdifferenttopicmodelsuncovertheselatenttopics.LSALatentSemanticAnalysis,orLSA,isoneofthefoundationaltechniquesintopicmodeling.Thecore
ideaistotakeamatrixofwhatwehave—documentsandterms—anddecomposeitintoa
separatedocument-topicmatrixandatopic-termmatrix.Thefirststepisgeneratingourdocument-termmatrix.Givenmdocumentsandnwordsinour
vocabulary,wecanconstructanm×nmatrixAinwhicheachrowrepresentsadocumentandeach
columnrepresentsaword.InthesimplestversionofLSA,eachentrycansimplybearawcountof
thenumberoftimesthej-thwordappearedinthei-thdocument.Inpractice,however,rawcounts
donotworkparticularlywellbecausetheydonotaccountforthesignificanceofeachwordinthe
document.Forexample,theword“nuclear”probablyinformsusmoreaboutthetopic(s)ofagiven
documentthantheword“test.”Consequently,LSAmodelstypicallyreplacerawcountsinthedocument-termmatrixwithatf-idf
score.Tf-idf,ortermfrequency-inversedocumentfrequency,assignsaweightfortermjin
documentiasfollows:
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 3/12
Intuitively,atermhasalargeweightwhenitoccursfrequentlyacrossthedocumentbutinfrequently
acrossthecorpus.Theword“build”mightappearofteninadocument,butbecauseit’slikelyfairly
commonintherestofthecorpus,itwillnothaveahightf-idfscore.However,iftheword
“gentrification”appearsofteninadocument,becauseitisrarerintherestofthecorpus,itwillhave
ahighertf-idfscore.Oncewehaveourdocument-termmatrixA,wecanstartthinkingaboutourlatenttopics.Here’sthe
thing:inalllikelihood,Aisverysparse,verynoisy,andveryredundantacrossitsmanydimensions.
Asaresult,tofindthefewlatenttopicsthatcapturetherelationshipsamongthewordsand
documents,wewanttoperformdimensionalityreductiononA.ThisdimensionalityreductioncanbeperformedusingtruncatedSVD.SVD,orsingularvalue
decomposition,isatechniqueinlinearalgebrathatfactorizesanymatrixMintotheproductof3
separatematrices:M=U*S*V,whereSisadiagonalmatrixofthesingularvaluesofM.Critically,
truncatedSVDreducesdimensionalitybyselectingonlythetlargestsingularvalues,andonly
keepingthefirsttcolumnsofUandV.Inthiscase,tisahyperparameterwecanselectandadjustto
reflectthenumberoftopicswewanttofind.
Intuitively,thinkofthisasonlykeepingthetmostsignificantdimensionsinourtransformedspace.
Inthiscase,U∈ℝ^(m⨉t)emergesasourdocument-topicmatrix,andV∈ℝ^(n⨉t)becomesour
term-topicmatrix.InbothUandV,thecolumnscorrespondtooneofourttopics.InU,rows
representdocumentvectorsexpressedintermsoftopics;inV,rowsrepresenttermvectors
expressedintermsoftopics.Withthesedocumentvectorsandtermvectors,wecannoweasilyapplymeasuressuchascosine
similaritytoevaluate:thesimilarityofdifferentdocumentsthesimilarityofdifferentwordsthesimilarityofterms(or“queries”)anddocuments(whichbecomesusefulininformation
retrieval,whenwewanttoretrievepassagesmostrelevanttooursearchquery).CodeInsklearn,asimpleimplementationofLSAmightlooksomethinglikethis:
fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.decompositionimportTruncatedSVDfromsklearn.pipelineimportPipelinedocuments=["doc1.txt","doc2.txt","doc3.txt"]#rawdocumentstotf-idfmatrix:
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 4/12
vectorizer=TfidfVectorizer(stop_words='english',use_idf=True,smooth_idf=True)#SVDtoreducedimensionality:svd_model=TruncatedSVD(n_components=100,//numdimensionsalgorithm='randomized',n_iter=10)#pipelineoftf-idf+SVD,fittoandappliedtodocuments:svd_transformer=Pipeline([('tfidf',vectorizer),('svd',svd_model)])svd_matrix=svd_transformer.fit_transform(documents)
#svd_matrixcanlaterbeusedtocomparedocuments,comparewords,orcomparequerieswithdocuments
LSAisquickandefficienttouse,butitdoeshaveafewprimarydrawbacks:lackofinterpretableembeddings(wedon’tknowwhatthetopicsare,andthecomponentsmay
bearbitrarilypositive/negative)needforreallylargesetofdocumentsandvocabularytogetaccurateresultslessefficientrepresentation
PLSApLSA,orProbabilisticLatentSemanticAnalysis,usesaprobabilisticmethodinsteadofSVDto
tackletheproblem.Thecoreideaistofindaprobabilisticmodelwithlatenttopicsthatcangenerate
thedataweobserveinourdocument-termmatrix.Inparticular,wewantamodelP(D,W)suchthat
foranydocumentdandwordw,P(d,w)correspondstothatentryinthedocument-termmatrix.Recallthebasicassumptionoftopicmodels:eachdocumentconsistsofamixtureoftopics,and
eachtopicconsistsofacollectionofwords.pLSAaddsaprobabilisticspintotheseassumptions:givenadocumentd,topiczispresentinthatdocumentwithprobabilityP(z|d)givenatopicz,wordwisdrawnfromzwithprobabilityP(w|z)
Formally,thejointprobabilityofseeingagivendocumentandwordtogetheris:
Intuitively,theright-handsideofthisequationistellingushowlikelyitisseesomedocument,and
thenbaseduponthedistributionoftopicsofthatdocument,howlikelyitistofindacertainword
withinthatdocument.Inthiscase,P(D),P(Z|D),andP(W|Z)aretheparametersofourmodel.P(D)canbedetermined
directlyfromourcorpus.P(Z|D)andP(W|Z)aremodeledasmultinomialdistributions,andcanbe
trainedusingtheexpectation-maximizationalgorithm(EM).Withoutgoingintoafull
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 5/12
mathematicaltreatmentofthealgorithm,EMisamethodoffindingthelikeliestparameter
estimatesforamodelwhichdependsonunobserved,latentvariables(inourcase,thetopics).Interestingly,P(D,W)canbeequivalentlyparameterizedusingadifferentsetof3parameters:
Wecanunderstandthisequivalencybylookingatthemodelasagenerativeprocess.Inourfirst
parameterization,wewerestartingwiththedocumentwithP(d),andthengeneratingthetopic
withP(z|d),andthengeneratingthewordwithP(w|z).Inthisparameterization,wearestarting
withthetopicwithP(z),andthenindependentlygeneratingthedocumentwithP(d|z)andthe
wordwithP(w|z).
https://www.slideshare.net/NYCPredictiveAnalytics/introduction-to-probabilistic-latent-
semantic-analysis
Thereasonthisnewparameterizationissointerestingisbecausewecanseeadirectparallel
betweenourpLSAmodelourLSAmodel:
wheretheprobabilityofourtopicP(Z)correspondstothediagonalmatrixofoursingulartopic
probabilities,theprobabilityofourdocumentgiventhetopicP(D|Z)correspondstoourdocument-
topicmatrixU,andtheprobabilityofourwordgiventhetopicP(W|Z)correspondstoourterm-
topicmatrixV.Sowhatdoesthattellus?Althoughitlooksquitedifferentandapproachestheprobleminavery
differentway,pLSAreallyjustaddsaprobabilistictreatmentoftopicsandwordsontopofLSA.Itis
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 6/12
afarmoreflexiblemodel,butstillhasafewproblems.Inparticular:BecausewehavenoparameterstomodelP(D),wedon’tknowhowtoassignprobabilitiesto
newdocumentsThenumberofparametersforpLSAgrowslinearlywiththenumberofdocumentswehave,so
itispronetooverfittingWewillnotlookatanycodeforpLSAbecauseitisrarelyusedonitsown.Ingeneral,whenpeople
arelookingforatopicmodelbeyondthebaselineperformanceLSAgives,theyturntoLDA.LDA,
themostcommontypeoftopicmodel,extendsPLSAtoaddresstheseissues.LDALDAstandsforLatentDirichletAllocation.LDAisaBayesianversionofpLSA.Inparticular,ituses
dirichletpriorsforthedocument-topicandword-topicdistributions,lendingitselftobetter
generalization.Iamnotgoingtointoanin-depthtreatmentofdirichletdistributions,sincethereareverygood
intuitiveexplanationshereandhere.Asabriefoverview,however,wecanthinkofdirichletasa
“distributionoverdistributions.”Inessence,itanswersthequestion:“giventhistypeofdistribution,
whataresomeactualprobabilitydistributionsIamlikelytosee?”Considertheveryrelevantexampleofcomparingprobabilitydistributionsoftopicmixtures.Let’s
saythecorpuswearelookingathasdocumentsfrom3verydifferentsubjectareas.Ifwewantto
modelthis,thetypeofdistributionwewantwillbeonethatveryheavilyweightsonespecifictopic,
anddoesn’tgivemuchweighttotherestatall.Ifwehave3topics,thensomespecificprobability
distributionswe’dlikelyseeare:MixtureX:90%topicA,5%topicB,5%topicCMixtureY:5%topicA,90%topicB,5%topicCMixtureZ:5%topicA,5%topicB,90%topicC
Ifwedrawarandomprobabilitydistributionfromthisdirichletdistribution,parameterizedbylarge
weightsonasingletopic,wewouldlikelygetadistributionthatstronglyresembleseithermixture
X,mixtureY,ormixtureZ.Itwouldbeveryunlikelyforustosampleadistributionthatis33%topic
A,33%topicB,and33%topicC.That’sessentiallywhatadirichletdistributionprovides:awayofsamplingprobabilitydistributions
ofaspecifictype.RecallthemodelforpLSA:
InpLSA,wesampleadocument,thenatopicbasedonthatdocument,thenawordbasedonthat
topic.HereisthemodelforLDA:
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 7/12
FromadirichletdistributionDir(α),wedrawarandomsamplerepresentingthetopicdistribution,
ortopicmixture,ofaparticulardocument.Thistopicdistributionisθ.Fromθ,weselectaparticular
topicZbasedonthedistribution.Next,fromanotherdirichletdistributionDir(�),weselectarandomsamplerepresentingtheword
distributionofthetopicZ.Thisworddistributionisφ.Fromφ,wechoosethewordw.Formally,theprocessforgeneratingeachwordfromadocumentisasfollows(bewarethis
algorithmusescinsteadofztorepresentthetopic):
https://cs.stanford.edu/~ppasupat/a9online/1140.html
LDAtypicallyworksbetterthanpLSAbecauseitcangeneralizetonewdocumentseasily.InpLSA,
thedocumentprobabilityisafixedpointinthedataset.Ifwehaven’tseenadocument,wedon’t
havethatdatapoint.InLDA,thedatasetservesastrainingdataforthedirichletdistributionof
document-topicdistributions.Ifwehaven’tseenadocument,wecaneasilysamplefromthe
dirichletdistributionandmoveforwardfromthere.CodeLDAiseasilythemostpopular(andtypicallymosteffective)topicmodelingtechniqueoutthere.It’s
availableingensimforeasyuse:
fromgensim.corpora.Dictionaryimportload_from_text,doc2bowfromgensim.corporaimportMmCorpusfromgensim.models.ldamodelimportLdaModeldocument="Thisissomedocument..."#loadid->wordmapping(thedictionary)id2word=load_from_text('wiki_en_wordids.txt')#loadcorpusiteratormm=MmCorpus('wiki_en_tfidf.mm')#extract100LDAtopics,updatingonceevery10,000lda=LdaModel(corpus=mm,id2word=id2word,num_topics=100,update_every=1,chunksize=10000,passes=1)#useLDAmodel:transformnewdoctobag-of-words,thenapplyldadoc_bow=doc2bow(document.split())doc_lda=lda[doc_bow]
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 8/12
#doc_ldaisvectoroflengthnum_topicsrepresentingweightedpresenceofeachtopicinthedoc
WithLDA,wecanextracthuman-interpretabletopicsfromadocumentcorpus,whereeachtopicis
characterizedbythewordstheyaremoststronglyassociatedwith.Forexample,topic2couldbe
characterizedbytermssuchas“oil,gas,drilling,pipes,Keystone,energy,”etc.Furthermore,givena
newdocument,wecanobtainavectorrepresentingitstopicmixture,e.g.5%topic1,70%topic2,
10%topic3,etc.Thesevectorsareoftenveryusefulfordownstreamapplications.LDAinDeepLearning:lda2vecSowheredothesetopicmodelsfactorintomorecomplexnaturallanguageprocessingproblems?Atthebeginningofthispost,wetalkedabouthowimportantitistobeabletoextractmeaningfrom
textateverylevel—word,paragraph,document.Atthedocumentlevel,wenowknowhowto
representthetextasmixturesoftopics.Atthewordlevel,wetypicallyusesomethinglikeword2vec
toobtainvectorrepresentations.lda2vecisanextensionofword2vecandLDAthatjointly
learnsword,document,andtopicvectors.Here’showitworks.lda2vecspecificallybuildsontopoftheskip-grammodelofword2vectogeneratewordvectors.If
you’renotfamiliarwithskip-gramandword2vec,youcanreaduponithere,butessentiallyit’sa
neuralnetthatlearnsawordembeddingbytryingtousetheinputwordtopredictsurrounding
contextwords.
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 9/12
Withlda2vec,insteadofusingthewordvectordirectlytopredictcontextwords,weleveragea
contextvectortomakethepredictions.Thiscontextvectoriscreatedasthesumoftwoother
vectors:thewordvectorandthedocumentvector.Thewordvectorisgeneratedbythesameskip-gramword2vecmodeldiscussedearlier.The
documentvectorismoreinteresting.Itisreallyaweightedcombinationoftwoothercomponents:thedocumentweightvector,representingthe“weights”(latertobetransformedinto
percentages)ofeachtopicinthedocumentthetopicmatrix,representingeachtopicanditscorrespondingvectorembedding
Together,thedocumentvectorandthewordvectorgenerate“context”vectorsforeachwordinthe
document.Thepoweroflda2vecliesinthefactthatitnotonlylearnswordembeddings(and
contextvectorembeddings)forwords,itsimultaneouslylearnstopicrepresentationsanddocument
representationsaswell.
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 10/12
https://multithreaded.stitch�x.com/blog/2016/05/27/lda2vec
Foramoredetailedoverviewofthemodel,checkoutChrisMoody’soriginalblogpost(Moody
createdlda2vecin2016).CodecanbefoundatMoody’sgithubrepositoryandthisJupyter
Notebookexample.ConclusionAlltoooften,wetreattopicmodelsasblack-boxalgorithmsthat“justwork.”Fortunately,unlike
manyneuralnets,topicmodelsareactuallyquiteinterpretableandmuchmorestraightforwardto
diagnose,tune,andevaluate.Hopefullythisblogposthasbeenabletoexplaintheunderlying
math,motivations,andintuitionyouneed,andleaveyouenoughhigh-levelcodetogetstarted.
Pleaseleaveyourthoughtsinthecomments,andhappyhacking!AboutNanonets
NanonetsmakesitsupereasytouseDeepLearning.Youcanbuildamodelwithyourowndatatoachievehighaccuracy&useourAPIstointegrate
thesameinyourapplication.
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 11/12
DataScience NLP Arti�cialIntelligence DeepLearning MachineLearning
4.4Kclaps
Seeresponses(29)
MoreFromMedium
MorefromNanoNets
HowtoAutomateSurveillance Easily
MorefromNanoNets
HowtoeasilydoObject Detection on
MorefromNanoNets
HowToEasilyClassify Food Using
WRITTENBY
JoyceXu Follow
Deeplearning,RL,NLP,CV,andallthatjazz.@DeepMindAI,@sidewalklabs,@Stanford
NanoNets Follow
NanoNets:MachineLearningAPI
Signin Getstarted
6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 12/12
SurveillanceEasilywithDeepLearning
Bharat…Aug3,… 4.5K
ObjectDetectiononDroneImageryusingDeeplearning
Gaurav…Jun6,… 3.6K
ClassifyFoodUsingDeepLearningAndTensorFlow
Bharat…Mar18… 457
Signin Getstarted