& lda2vec topic mo def65b0b05ling with lsa ,...

6/30/2019 Topic Modeling with LSA, PSLA, LDA & lda2Vec - NanoNets - Medium

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 1/12

TopicModelingwithLSA,PLSA,LDA&lda2Vec

JoyceXuinNanoNets Follow

May25,2018·12minread

ThisarticleisacomprehensiveoverviewofTopicModelinganditsassociatedtechniques.

Innaturallanguageunderstanding(NLU)tasks,thereisahierarchyoflensesthroughwhichwecan

extractmeaning—fromwordstosentencestoparagraphstodocuments.Atthedocumentlevel,

Signin Getstarted

https://medium.com/@joycex99

https://medium.com/@joycex99

https://medium.com/nanonets

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05

https://medium.com/


https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fmedium.com%2Fnanonets%2Ftopic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05&source=--------------------------nav_reg-



oneofthemostusefulwaystounderstandtextisbyanalyzingitstopics.Theprocessoflearning,

recognizing,andextractingthesetopicsacrossacollectionofdocumentsiscalledtopicmodeling.Inthispost,wewillexploretopicmodelingthrough4ofthemostpopulartechniquestoday:LSA,

pLSA,LDA,andthenewer,deeplearning-basedlda2vec.OverviewAlltopicmodelsarebasedonthesamebasicassumption:

eachdocumentconsistsofamixtureoftopics,andeachtopicconsistsofacollectionofwords.

Inotherwords,topicmodelsarebuiltaroundtheideathatthesemanticsofourdocumentare

actuallybeinggovernedbysomehidden,or“latent,”variablesthatwearenotobserving.Asa

result,thegoaloftopicmodelingistouncovertheselatentvariables—topics—thatshapethe

meaningofourdocumentandcorpus.Therestofthisblogpostwillbuildupanunderstandingof

howdifferenttopicmodelsuncovertheselatenttopics.LSALatentSemanticAnalysis,orLSA,isoneofthefoundationaltechniquesintopicmodeling.Thecore

ideaistotakeamatrixofwhatwehave—documentsandterms—anddecomposeitintoa

separatedocument-topicmatrixandatopic-termmatrix.Thefirststepisgeneratingourdocument-termmatrix.Givenmdocumentsandnwordsinour

vocabulary,wecanconstructanm×nmatrixAinwhicheachrowrepresentsadocumentandeach

columnrepresentsaword.InthesimplestversionofLSA,eachentrycansimplybearawcountof

thenumberoftimesthej-thwordappearedinthei-thdocument.Inpractice,however,rawcounts

donotworkparticularlywellbecausetheydonotaccountforthesignificanceofeachwordinthe

document.Forexample,theword“nuclear”probablyinformsusmoreaboutthetopic(s)ofagiven

documentthantheword“test.”Consequently,LSAmodelstypicallyreplacerawcountsinthedocument-termmatrixwithatf-idf

score.Tf-idf,ortermfrequency-inversedocumentfrequency,assignsaweightfortermjin

documentiasfollows:

Signin Getstarted

https://medium.com/





Intuitively,atermhasalargeweightwhenitoccursfrequentlyacrossthedocumentbutinfrequently

acrossthecorpus.Theword“build”mightappearofteninadocument,butbecauseit’slikelyfairly

commonintherestofthecorpus,itwillnothaveahightf-idfscore.However,iftheword

“gentrification”appearsofteninadocument,becauseitisrarerintherestofthecorpus,itwillhave

ahighertf-idfscore.Oncewehaveourdocument-termmatrixA,wecanstartthinkingaboutourlatenttopics.Here’sthe

thing:inalllikelihood,Aisverysparse,verynoisy,andveryredundantacrossitsmanydimensions.

Asaresult,tofindthefewlatenttopicsthatcapturetherelationshipsamongthewordsand

documents,wewanttoperformdimensionalityreductiononA.ThisdimensionalityreductioncanbeperformedusingtruncatedSVD.SVD,orsingularvalue

decomposition,isatechniqueinlinearalgebrathatfactorizesanymatrixMintotheproductof3

separatematrices:M=U*S*V,whereSisadiagonalmatrixofthesingularvaluesofM.Critically,

truncatedSVDreducesdimensionalitybyselectingonlythetlargestsingularvalues,andonly

keepingthefirsttcolumnsofUandV.Inthiscase,tisahyperparameterwecanselectandadjustto

reflectthenumberoftopicswewanttofind.

Intuitively,thinkofthisasonlykeepingthetmostsignificantdimensionsinourtransformedspace.

Inthiscase,U∈ℝ^(m⨉t)emergesasourdocument-topicmatrix,andV∈ℝ^(n⨉t)becomesour

term-topicmatrix.InbothUandV,thecolumnscorrespondtooneofourttopics.InU,rows

representdocumentvectorsexpressedintermsoftopics;inV,rowsrepresenttermvectors

expressedintermsoftopics.Withthesedocumentvectorsandtermvectors,wecannoweasilyapplymeasuressuchascosine

similaritytoevaluate:thesimilarityofdifferentdocumentsthesimilarityofdifferentwordsthesimilarityofterms(or“queries”)anddocuments(whichbecomesusefulininformation

retrieval,whenwewanttoretrievepassagesmostrelevanttooursearchquery).CodeInsklearn,asimpleimplementationofLSAmightlooksomethinglikethis:

fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.decompositionimportTruncatedSVDfromsklearn.pipelineimportPipelinedocuments=["doc1.txt","doc2.txt","doc3.txt"]#rawdocumentstotf-idfmatrix:

Signin Getstarted

https://en.wikipedia.org/wiki/Singular_value

https://medium.com/





vectorizer=TfidfVectorizer(stop_words='english',use_idf=True,smooth_idf=True)#SVDtoreducedimensionality:svd_model=TruncatedSVD(n_components=100,//numdimensionsalgorithm='randomized',n_iter=10)#pipelineoftf-idf+SVD,fittoandappliedtodocuments:svd_transformer=Pipeline([('tfidf',vectorizer),('svd',svd_model)])svd_matrix=svd_transformer.fit_transform(documents)

#svd_matrixcanlaterbeusedtocomparedocuments,comparewords,orcomparequerieswithdocuments

LSAisquickandefficienttouse,butitdoeshaveafewprimarydrawbacks:lackofinterpretableembeddings(wedon’tknowwhatthetopicsare,andthecomponentsmay

bearbitrarilypositive/negative)needforreallylargesetofdocumentsandvocabularytogetaccurateresultslessefficientrepresentation

PLSApLSA,orProbabilisticLatentSemanticAnalysis,usesaprobabilisticmethodinsteadofSVDto

tackletheproblem.Thecoreideaistofindaprobabilisticmodelwithlatenttopicsthatcangenerate

thedataweobserveinourdocument-termmatrix.Inparticular,wewantamodelP(D,W)suchthat

foranydocumentdandwordw,P(d,w)correspondstothatentryinthedocument-termmatrix.Recallthebasicassumptionoftopicmodels:eachdocumentconsistsofamixtureoftopics,and

eachtopicconsistsofacollectionofwords.pLSAaddsaprobabilisticspintotheseassumptions:givenadocumentd,topiczispresentinthatdocumentwithprobabilityP(z|d)givenatopicz,wordwisdrawnfromzwithprobabilityP(w|z)

Formally,thejointprobabilityofseeingagivendocumentandwordtogetheris:

Intuitively,theright-handsideofthisequationistellingushowlikelyitisseesomedocument,and

thenbaseduponthedistributionoftopicsofthatdocument,howlikelyitistofindacertainword

withinthatdocument.Inthiscase,P(D),P(Z|D),andP(W|Z)aretheparametersofourmodel.P(D)canbedetermined

directlyfromourcorpus.P(Z|D)andP(W|Z)aremodeledasmultinomialdistributions,andcanbe

trainedusingtheexpectation-maximizationalgorithm(EM).Withoutgoingintoafull

Signin Getstarted

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

https://medium.com/





mathematicaltreatmentofthealgorithm,EMisamethodoffindingthelikeliestparameter

estimatesforamodelwhichdependsonunobserved,latentvariables(inourcase,thetopics).Interestingly,P(D,W)canbeequivalentlyparameterizedusingadifferentsetof3parameters:

Wecanunderstandthisequivalencybylookingatthemodelasagenerativeprocess.Inourfirst

parameterization,wewerestartingwiththedocumentwithP(d),andthengeneratingthetopic

withP(z|d),andthengeneratingthewordwithP(w|z).Inthisparameterization,wearestarting

withthetopicwithP(z),andthenindependentlygeneratingthedocumentwithP(d|z)andthe

wordwithP(w|z).

https://www.slideshare.net/NYCPredictiveAnalytics/introduction-to-probabilistic-latent-

semantic-analysis

Thereasonthisnewparameterizationissointerestingisbecausewecanseeadirectparallel

betweenourpLSAmodelourLSAmodel:

wheretheprobabilityofourtopicP(Z)correspondstothediagonalmatrixofoursingulartopic

probabilities,theprobabilityofourdocumentgiventhetopicP(D|Z)correspondstoourdocument-

topicmatrixU,andtheprobabilityofourwordgiventhetopicP(W|Z)correspondstoourterm-

topicmatrixV.Sowhatdoesthattellus?Althoughitlooksquitedifferentandapproachestheprobleminavery

differentway,pLSAreallyjustaddsaprobabilistictreatmentoftopicsandwordsontopofLSA.Itis

Signin Getstarted

https://www.slideshare.net/NYCPredictiveAnalytics/introduction-to-probabilistic-latent-semantic-analysis

https://medium.com/





afarmoreflexiblemodel,butstillhasafewproblems.Inparticular:BecausewehavenoparameterstomodelP(D),wedon’tknowhowtoassignprobabilitiesto

newdocumentsThenumberofparametersforpLSAgrowslinearlywiththenumberofdocumentswehave,so

itispronetooverfittingWewillnotlookatanycodeforpLSAbecauseitisrarelyusedonitsown.Ingeneral,whenpeople

arelookingforatopicmodelbeyondthebaselineperformanceLSAgives,theyturntoLDA.LDA,

themostcommontypeoftopicmodel,extendsPLSAtoaddresstheseissues.LDALDAstandsforLatentDirichletAllocation.LDAisaBayesianversionofpLSA.Inparticular,ituses

dirichletpriorsforthedocument-topicandword-topicdistributions,lendingitselftobetter

generalization.Iamnotgoingtointoanin-depthtreatmentofdirichletdistributions,sincethereareverygood

intuitiveexplanationshereandhere.Asabriefoverview,however,wecanthinkofdirichletasa

“distributionoverdistributions.”Inessence,itanswersthequestion:“giventhistypeofdistribution,

whataresomeactualprobabilitydistributionsIamlikelytosee?”Considertheveryrelevantexampleofcomparingprobabilitydistributionsoftopicmixtures.Let’s

saythecorpuswearelookingathasdocumentsfrom3verydifferentsubjectareas.Ifwewantto

modelthis,thetypeofdistributionwewantwillbeonethatveryheavilyweightsonespecifictopic,

anddoesn’tgivemuchweighttotherestatall.Ifwehave3topics,thensomespecificprobability

distributionswe’dlikelyseeare:MixtureX:90%topicA,5%topicB,5%topicCMixtureY:5%topicA,90%topicB,5%topicCMixtureZ:5%topicA,5%topicB,90%topicC

Ifwedrawarandomprobabilitydistributionfromthisdirichletdistribution,parameterizedbylarge

weightsonasingletopic,wewouldlikelygetadistributionthatstronglyresembleseithermixture

X,mixtureY,ormixtureZ.Itwouldbeveryunlikelyforustosampleadistributionthatis33%topic

A,33%topicB,and33%topicC.That’sessentiallywhatadirichletdistributionprovides:awayofsamplingprobabilitydistributions

ofaspecifictype.RecallthemodelforpLSA:

InpLSA,wesampleadocument,thenatopicbasedonthatdocument,thenawordbasedonthat

topic.HereisthemodelforLDA:

Signin Getstarted

https://www.quora.com/What-is-an-intuitive-explanation-of-the-Dirichlet-distribution

https://stats.stackexchange.com/questions/244917/what-exactly-is-the-alpha-in-the-dirichlet-distribution

https://medium.com/





FromadirichletdistributionDir(α),wedrawarandomsamplerepresentingthetopicdistribution,

ortopicmixture,ofaparticulardocument.Thistopicdistributionisθ.Fromθ,weselectaparticular

topicZbasedonthedistribution.Next,fromanotherdirichletdistributionDir(�),weselectarandomsamplerepresentingtheword

distributionofthetopicZ.Thisworddistributionisφ.Fromφ,wechoosethewordw.Formally,theprocessforgeneratingeachwordfromadocumentisasfollows(bewarethis

algorithmusescinsteadofztorepresentthetopic):

https://cs.stanford.edu/~ppasupat/a9online/1140.html

LDAtypicallyworksbetterthanpLSAbecauseitcangeneralizetonewdocumentseasily.InpLSA,

thedocumentprobabilityisafixedpointinthedataset.Ifwehaven’tseenadocument,wedon’t

havethatdatapoint.InLDA,thedatasetservesastrainingdataforthedirichletdistributionof

document-topicdistributions.Ifwehaven’tseenadocument,wecaneasilysamplefromthe

dirichletdistributionandmoveforwardfromthere.CodeLDAiseasilythemostpopular(andtypicallymosteffective)topicmodelingtechniqueoutthere.It’s

availableingensimforeasyuse:

fromgensim.corpora.Dictionaryimportload_from_text,doc2bowfromgensim.corporaimportMmCorpusfromgensim.models.ldamodelimportLdaModeldocument="Thisissomedocument..."#loadid->wordmapping(thedictionary)id2word=load_from_text('wiki_en_wordids.txt')#loadcorpusiteratormm=MmCorpus('wiki_en_tfidf.mm')#extract100LDAtopics,updatingonceevery10,000lda=LdaModel(corpus=mm,id2word=id2word,num_topics=100,update_every=1,chunksize=10000,passes=1)#useLDAmodel:transformnewdoctobag-of-words,thenapplyldadoc_bow=doc2bow(document.split())doc_lda=lda[doc_bow]

Signin Getstarted

https://cs.stanford.edu/~ppasupat/a9online/1140.html

https://radimrehurek.com/gensim/models/ldamodel.html

https://medium.com/





#doc_ldaisvectoroflengthnum_topicsrepresentingweightedpresenceofeachtopicinthedoc

WithLDA,wecanextracthuman-interpretabletopicsfromadocumentcorpus,whereeachtopicis

characterizedbythewordstheyaremoststronglyassociatedwith.Forexample,topic2couldbe

characterizedbytermssuchas“oil,gas,drilling,pipes,Keystone,energy,”etc.Furthermore,givena

newdocument,wecanobtainavectorrepresentingitstopicmixture,e.g.5%topic1,70%topic2,

10%topic3,etc.Thesevectorsareoftenveryusefulfordownstreamapplications.LDAinDeepLearning:lda2vecSowheredothesetopicmodelsfactorintomorecomplexnaturallanguageprocessingproblems?Atthebeginningofthispost,wetalkedabouthowimportantitistobeabletoextractmeaningfrom

textateverylevel—word,paragraph,document.Atthedocumentlevel,wenowknowhowto

representthetextasmixturesoftopics.Atthewordlevel,wetypicallyusesomethinglikeword2vec

toobtainvectorrepresentations.lda2vecisanextensionofword2vecandLDAthatjointly

learnsword,document,andtopicvectors.Here’showitworks.lda2vecspecificallybuildsontopoftheskip-grammodelofword2vectogeneratewordvectors.If

you’renotfamiliarwithskip-gramandword2vec,youcanreaduponithere,butessentiallyit’sa

neuralnetthatlearnsawordembeddingbytryingtousetheinputwordtopredictsurrounding

contextwords.

Signin Getstarted

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

https://medium.com/





Withlda2vec,insteadofusingthewordvectordirectlytopredictcontextwords,weleveragea

contextvectortomakethepredictions.Thiscontextvectoriscreatedasthesumoftwoother

vectors:thewordvectorandthedocumentvector.Thewordvectorisgeneratedbythesameskip-gramword2vecmodeldiscussedearlier.The

documentvectorismoreinteresting.Itisreallyaweightedcombinationoftwoothercomponents:thedocumentweightvector,representingthe“weights”(latertobetransformedinto

percentages)ofeachtopicinthedocumentthetopicmatrix,representingeachtopicanditscorrespondingvectorembedding

Together,thedocumentvectorandthewordvectorgenerate“context”vectorsforeachwordinthe

document.Thepoweroflda2vecliesinthefactthatitnotonlylearnswordembeddings(and

contextvectorembeddings)forwords,itsimultaneouslylearnstopicrepresentationsanddocument

representationsaswell.

Signin Getstarted

https://medium.com/





https://multithreaded.stitch�x.com/blog/2016/05/27/lda2vec

Foramoredetailedoverviewofthemodel,checkoutChrisMoody’soriginalblogpost(Moody

createdlda2vecin2016).CodecanbefoundatMoody’sgithubrepositoryandthisJupyter

Notebookexample.ConclusionAlltoooften,wetreattopicmodelsasblack-boxalgorithmsthat“justwork.”Fortunately,unlike

manyneuralnets,topicmodelsareactuallyquiteinterpretableandmuchmorestraightforwardto

diagnose,tune,andevaluate.Hopefullythisblogposthasbeenabletoexplaintheunderlying

math,motivations,andintuitionyouneed,andleaveyouenoughhigh-levelcodetogetstarted.

Pleaseleaveyourthoughtsinthecomments,andhappyhacking!AboutNanonets

NanonetsmakesitsupereasytouseDeepLearning.Youcanbuildamodelwithyourowndatatoachievehighaccuracy&useourAPIstointegrate

thesameinyourapplication.

Signin Getstarted

https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=

https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=

https://github.com/cemoody/lda2vec

http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

https://nanonets.com/?utm_source=Medium&utm_campaign=How%20to%20do%20Topic%20Modeling%20with%20LSA,%20PSLA,%20LDA,%20and%C2%A0lda2Vec/

https://medium.com/





[email protected]

DataScience NLP Arti�cialIntelligence DeepLearning MachineLearning

4.4Kclaps

Seeresponses(29)

MoreFromMedium

MorefromNanoNets

HowtoAutomateSurveillance Easily

MorefromNanoNets

HowtoeasilydoObject Detection on

MorefromNanoNets

HowToEasilyClassify Food Using

WRITTENBY

JoyceXu Follow

Deeplearning,RL,NLP,CV,andallthatjazz.@DeepMindAI,@sidewalklabs,@Stanford

NanoNets Follow

NanoNets:MachineLearningAPI

Signin Getstarted

https://nanonets.com/?utm_source=Medium&utm_campaign=How%20to%20do%20Topic%20Modeling%20with%20LSA,%20PSLA,%20LDA,%20and%C2%A0lda2Vec/

https://medium.com/tag/data-science

https://medium.com/tag/nlp

https://medium.com/tag/artificial-intelligence

https://medium.com/tag/deep-learning

https://medium.com/tag/machine-learning

https://medium.com/p/555ff65b0b05/share/twitter?source=follow_footer--------------------------follow_footer-

https://medium.com/p/555ff65b0b05/share/facebook?source=follow_footer--------------------------follow_footer-

https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnanonets%2Ftopic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05&source=post_actions_footer--------------------------bookmark_sidebar-

https://medium.com/p/555ff65b0b05/responses/show?source=follow_footer--------------------------follow_footer-

https://medium.com/nanonets/how-to-automate-surveillance-easily-with-deep-learning-4eb4fa0cd68d?source=post_recirc---------0------------------


https://medium.com/nanonets/how-we-flew-a-drone-to-monitor-construction-projects-in-africa-using-deep-learning-b792f5c9c471?source=post_recirc---------1------------------


https://medium.com/nanonets/how-to-easily-classify-food-using-deep-learning-and-tensorflow-cbe9b1dc302c?source=post_recirc---------2------------------


https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnanonets%2Ftopic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05&source=post_actions_footer-----555ff65b0b05---------------------clap_footer-

https://medium.com/@joycex99?source=follow_footer--------------------------follow_footer-

https://medium.com/@joycex99?source=follow_footer--------------------------follow_footer-

https://medium.com/nanonets?source=follow_footer--------------------------follow_footer-

https://medium.com/nanonets?source=follow_footer--------------------------follow_footer-

https://medium.com/





SurveillanceEasilywithDeepLearning

Bharat…Aug3,… 4.5K

ObjectDetectiononDroneImageryusingDeeplearning

Gaurav…Jun6,… 3.6K

ClassifyFoodUsingDeepLearningAndTensorFlow

Bharat…Mar18… 457

Signin Getstarted


https://medium.com/@thatbrguy?source=post_recirc---------0------------------



https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnanonets%2Ftopic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05&source=post_recirc---------0-----------------bookmark_sidebar-


https://medium.com/@KailaGaurav?source=post_recirc---------1------------------

https://medium.com/@KailaGaurav?source=post_recirc---------1------------------








https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnanonets%2Ftopic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05&source=post_recirc-----4eb4fa0cd68d----0-----------------clap_preview-

https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnanonets%2Ftopic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05&source=post_recirc-----b792f5c9c471----1-----------------clap_preview-

https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnanonets%2Ftopic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05&source=post_recirc-----cbe9b1dc302c----2-----------------clap_preview-

https://medium.com/



& lda2vec topic mo def65b0b05ling with lsa ,...

Documents