topic&models& - inf.ed.ac.uk · topic&models& • we&wantto&find&...
TRANSCRIPT
![Page 2: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/2.jpg)
Topic Models
• We want to find themes (or topics) in documents, which is useful for, e.g. searching or visualising themes in document collecAons
• We don’t want to do manual annotaAon – Ame consuming and relies on availability of human annotators.
• Approach should automa,cally tease out the topics
• EssenAally a clustering problem where both words and documents are clustered.
![Page 3: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/3.jpg)
Topic Models Simple intui+on: documents exhibit mulAple topics
![Page 4: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/4.jpg)
Latent Dirichlet AllocaAon (LDA)
• Each topic is a distribuAon over words • Each document is a mixture of corpus-‐wide topics • Each word is drawn from one of these topics
![Page 5: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/5.jpg)
Posterior DistribuAon
• In reality, we only observe documents • The other structure are hidden variables
![Page 6: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/6.jpg)
Posterior DistribuAon
• Our goal is to infer the hidden variables • i.e. compute their distribuAon condiAoned on the documents:
p(topics, propor,ons, assignments|documents)
![Page 7: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/7.jpg)
LDA: key assumpAons
• Documents exhibit mulAple topics (but typically not many)
• LDA is a probabilis,c model with a corresponding genera,ve process (each document is generated by this process)
• A topic is a distribu,on over a fixed vocabulary (topics are assumed to be generated first, before the documents)
• Only the number of topics is specified in advance
![Page 8: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/8.jpg)
Document GeneraAon
1. Choose a number of words N the document will have
2. Choose topic mixture for document according to Dirichlet distribuAon over a set of T topics
3. Generate each word wi in the document by: 1. Randomly choosing a topic from the distribuAon
over topics 2. Randomly choosing a word from the corresponding
topic (distribuAon over vocabulary)
![Page 9: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/9.jpg)
Document GeneraAon
1. Pick 5 to be the number of words in D. 2. Decide D will be ½ about food and ½ about cute
animals. 3. Pick first word to come from the food topic: broccoli 4. Pick second word to come from the cute animals
topic: panda 5. Pick third word from the cute animals topic: adorable. 6. Pick fourth word from the food topic: cherries. 7. Pick fiYh word from food topic: ea,ng. What is the document generated under the LDA model?
![Page 10: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/10.jpg)
Learning
1. Go through each document and randomly assign each word in the document to one of T topics.
2. For each document d, go through each word w in d and for each topic t compute: 1. p(topic t|document d) 2. p(word w|topic t)
3. Reassign w to a new topic, where we choose topic t with probability: p(topic t|document d) x p(word w|topic t)
4. AYer repeaAng the previous step a large number of Ames, you will eventually reach a roughly steady state where your assignments are pre[y good.
![Page 11: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/11.jpg)
GeneraAve Process For j = 1 … T topics choose
For d = 1 … D documents choose For i = 1 … Nd words in doc d choose choose
φ j( ) ≈ Dirichlet β( )
θ d( ) ≈ Dirichlet α( )
zi ≈ Multinomial θd( )( )
wi ≈ Multinomial ϕzi( )( )
: prob. of each wd. in topic j :prob. of each topic in doc. d zi: topic of word i
wi: idenAty of word i
φVj( )…φv
j( )
θ1d( )…θT
d( )
![Page 12: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/12.jpg)
The Graphical Model
• Nodes are random variables; edges indicate dependence • Shaded nodes are observed; plates ≈ replicated variables
![Page 13: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/13.jpg)
The Graphical Model
• Nodes are random variables; edges indicate dependence • Shaded nodes are observed; plates ≈ replicated variables
ç Per-‐document topic proporAons
ProporAons parameter è
ç Per-‐word topic assignment
Topics ê
Topics parameter è ç Observed words
ç Documents
![Page 14: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/14.jpg)
The Graphical Model
P wi( ) = P wi zi = j( )P zi = j( )j=1
T
∑
• P(zi = j) is probability that jth topic was sampled from ith word token
• P(wi|zi = j) is probability of word wi under topic j
• .
• .
φ j( ) = P zi = j( )
θ d( ) = P z( )
![Page 15: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/15.jpg)
Dirichlet DistribuAon
p θ α( ) = Dir α1,…,αT( )Γ α jj∑( )∏ j Γ α j( )
θ jα j=1
j=1
T
∏
• Dirichlet distribuAon is an exponenAal family distribuAon over the simplex, i.e. posiAve vectors that sum to one.
• It is conjugate to the mulAnomial distribuAon. Given a mulAnomial observaAon, the posterior distribuAon of θ is a Dirichlet.
• Parameter α smoothes topic distribuAon in the document. • θ smoothes word distribuAon in every topic.
![Page 16: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/16.jpg)
![Page 17: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/17.jpg)
![Page 18: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/18.jpg)
![Page 19: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/19.jpg)
![Page 20: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/20.jpg)
![Page 21: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/21.jpg)
![Page 22: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/22.jpg)
Inference with Gibbs Sampling
• An iteraAve process • Start with random topic assignment for each word – Assume you know (from the previous iteraAon) the topics
– Determine the probabiliAes of each topic-‐assignment given the rest of data
– Choose the most probable assignment. • Iterate unAl converge
![Page 23: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/23.jpg)
Gibbs Sampling • CollecAon of documents is a set of words wi and document indices di, for each word token i.
• From this condiAonal distribuAon a topic is sampled and stored as the new topic assignment for this word token. zi = j: topic assignment of token i to topic j z_i : topic assignments of all other word tokens : matrix of counts with dimensions WxT : matrix of counts with dimensions DxT
P zi = j z_ i,wi,di, ⋅( ) =Cdi j
DT +α
CditDT +Tα
t=1
T
∑
Cwi jWT +β
Cwj
WT +Wβw=1
W
∑
CWT
CDT
![Page 24: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/24.jpg)
Posterio EsAmates of β and θ
φij =Cij
WT +β
Cwi jWT +Wβ
t=1
W
∑θd j =
Cdi jDT +α
CditDT +Tα
t=1
T
∑
• ϕij: normalizes word-‐topic count matrix (probability of word type i for topic j)
• θdj: normalizes the counts in the documents-‐topic matrix (probability that document d belongs to topic j)
• Techincally the posterior mean of the parameters φ and θ
![Page 25: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/25.jpg)
Example Inference
Data: The OCR’ed collecAon of Science from 1990-‐2000 17K documents 11M words 20K unique terms (stop words and rare words removed)
Model: 100-‐topic LDA model
![Page 26: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/26.jpg)
![Page 27: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/27.jpg)
Why does LDA “work”? • LDA trades off two goals:
1. For each document, allocate its words to as few topics as possible.
2. For each topic, assign high probability to as few terms as possible.
• These goals are at odds: – Puong a document in a single topics makes 2 hard: all words must have probability under that topic.
– Puong very few words in each topic makes 1 hard: to cover a document’s words, it must assign many topics to it.
• Trading off these goals finds groups of Aghtly co-‐occurring words.
![Page 28: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/28.jpg)
LDA summary
• LDA is a probabilis,c model of text. It casts the problem of discovering themes in large document collec,ons as a posterior inference.
• It allows visualisa,on of hidden thema,c structure of large document collecAons.
• LDA is a building block of many applicaAons. • It is popular because organizing and finding pa[erns in data has become important in sciences, humaniAes, industry, etc.
![Page 29: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/29.jpg)
Extending Topic Models
• LDA can be embedded in more complex models, embodying further intuiAons about the structure of the text.
• E.g. used in models that account for syntax, authors, correlaAon or hierarchies.
![Page 30: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/30.jpg)
Extending Topic Models
• Data generaAng distribuAon can be changed. We can apply mixed-‐membership assumpAons to many kinds of data.
• E.g. we can build models of images, social networks, music, computer code, geneAc data, etc.
![Page 31: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&](https://reader034.vdocuments.net/reader034/viewer/2022050503/5f95316cff705041a706d193/html5/thumbnails/31.jpg)
Extending Topic Models
• The posterior can be used in creaAve ways. • E.g. we can use inferences in informaAon retrieval, recommendaAon, visualisaAon, summarizaAon, etc.