topic models - socialdatascience.network
TRANSCRIPT
![Page 1: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/1.jpg)
Dirk Hovy
@dirk_hovy
Natural Language Processing
Topic Models
![Page 2: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/2.jpg)
• Understand what information topic models can and can not provide
• Learn about the Latent Dirichlet Allocation (LDA) model
• Understand the parameters influencing the output
• Learn about the Structured Author Topic Model
• Learn about evaluation criteria
Goals for Today
2
![Page 3: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/3.jpg)
What Gets Funded?
3
Talley et al., 2011
![Page 4: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/4.jpg)
What do People Want in Smart Devices?
4
Price price, buy, gift, christmas, worthIntegration light, control, command, system, integration
Sound speaker, sound, quality, small, musicAccuracy question, answer, time, response, quick
Skills music, weather, news, alarm, timerFun fun, family, kid, useful, helpful
Ease of Use easy, use, set, setup, simpleMusic music, play, song, playlist, favorite
Nguyen & Hovy (2019)
![Page 5: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/5.jpg)
Topics are Word Lists
• "pasta, pizza, wine, sauce, spaghetti"
• "BLEU, Bert, encoder, decoder, transformer"
5
TOPIC OR NOT?
SOME DOMAIN KNOWLEDGE REQUIRED…
![Page 6: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/6.jpg)
How to use Topic Models
6
CORPUS MODEL TOPICSDESCRIPTORS
- find best #topics - find best parameters - check output
- preprocess - choose top 5 words - name
[pasta, pizza, wine, sauce, spaghetti]
FOOD
![Page 7: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/7.jpg)
Preprocessing
![Page 8: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/8.jpg)
Preprocessing• Be aggressive:
• lemmatization,
• stopwords,
• replace numbers/user names,
• join collocations
• use TFIDF
• use minimum document frequency 10, 20, 50, or even 100
• use maximum document frequency 50% – 10%
8
![Page 9: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/9.jpg)
Pre-processing steps
9
<div id="text">I've been in New York in 2011, but didn’t like it. I preferred Los Angeles.</div>
GOAL: MINIMIZE VARIATION
![Page 10: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/10.jpg)
Pre-processing steps• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
10
I've been in New York in 2011, but didn’t like it. I preferred Los Angeles.
![Page 11: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/11.jpg)
Pre-processing steps• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
11
I've been in New York in 2011, but didn’t like it.
I preferred Los Angeles.
![Page 12: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/12.jpg)
• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
12
I ’ve been in New York in 2011 , but did n’t like it .
I preferred Los Angeles .
Pre-processing steps
![Page 13: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/13.jpg)
• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
13
i ‘ve been in new york in 0000 , but did n’t like it .
i preferred los angeles .
Pre-processing steps
![Page 14: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/14.jpg)
• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
14
i have be in new york in 0000 , but do not like it .
i prefer los angeles .
Pre-processing steps
![Page 15: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/15.jpg)
• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
15
i new york 0000 , like .
i prefer los angeles .
Pre-processing steps
![Page 16: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/16.jpg)
• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
16
new york 0000 like
prefer los angeles
CONTENT = (NOUN, VERB, NUM)
Pre-processing steps
![Page 17: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/17.jpg)
• Remove formatting (e.g. HTML)
• Segment sentences
• Tokenize words
• Normalize words
• numbers
• lemmas vs. stems
• Remove unwanted words
• stopwords
• content words (use POS tagging!)
• join collocations
17
new_york 0000 like
prefer los_angeles
Pre-processing steps
![Page 18: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/18.jpg)
18
new_york 0000 like
prefer los_angeles
<div id="text">I've been in New York in 2011, but didn’t like it. I preferred Los Angeles.</div>
“BAG OF WORDS"
MINIMIMAL VARIATION
Pre-processing steps
![Page 19: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/19.jpg)
Representing Text
![Page 20: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/20.jpg)
N-grams
20
"As Gregor Samsa awoke one morning from uneasy dreams, he found himself transformed in his bed into a gigantic insect-like creature."
As, Gregor, Samsa, awoke, one, morning, from, uneasy, dreams, ...
As_Gregor, Gregor_Samsa, Samsa_awoke, awoke_one, one_morning, ...
As_Gregor_Samsa, Gregor_Samsa_awoke, Samsa_awoke_one, awoke_one_morning, ...
As_Gregor_Samsa_awoke, Gregor_Samsa_awoke_one, Samsa_awoke_one_morning, ...
Bigrams
Trigrams
4-grams
Unigrams
![Page 21: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/21.jpg)
Bags of words (BOW)
21
{ 'shakespeare': 6, 'in': 20, 'love': 6, 'is': ...}
6 … 20 0 …æ
0 6 0 … 0
Xshakespeare
in love
...
beer
COUNT WORDS
VECTORIZE FEATURES
![Page 22: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/22.jpg)
Counting Trouble
22
…AND A MAN NAMED ZIPF
50%
T H E O T H E R 5 0 % . . .
![Page 23: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/23.jpg)
Finding Important Words: TF-IDF
![Page 24: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/24.jpg)
Some Words are Just More Interesting…
24
the
the
the
thethe
thethe
the
thethe
thethe
thethethe
thethe
sustainable
sustainable
![Page 25: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/25.jpg)
Karen Spärck Jones
1935–2007
• Became a teacher before starting CS career at Cambridge
• Laid the foundation for modern NLP, Google Search, text classification
• Campaigned for more women in CS
• Namesake of prestigious CS prize
![Page 26: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/26.jpg)
Problems with Term Frequency
26
BORING STUFF
OBSCURE STUFFJUST RIGHT ???
TOLD YA SO…
![Page 27: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/27.jpg)
df(w)
Document and Term Frequency
27
2
5
1
1
X
TERM FREQUENCY (SUM): 9
DOCUMENT FREQUENCY (COUNT): 4
IDF = Nlog
TF
FEATURE
![Page 28: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/28.jpg)
Putting it Together
28
df(w)TFIDF(w) = NlogTF(w) •
HOW OFTEN WE SAW THE WORD
ADJUSTED BY HOW MANY DOCUMENTS
![Page 29: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/29.jpg)
Document and Term Frequency
29
YE
CHAPTER
MAN WHALE
AHAB
Words in "Moby Dick"tf
word tf idf tfidfye 467 4.257380 148.497079
chapter 171 5.039475 147.504638whale 1150 3.262357 139.755743man 525 3.982412 106.932953ahab 511 4.019453 103.357774
![Page 30: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/30.jpg)
Latent Dirichlet Allocation
![Page 31: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/31.jpg)
How to Generate Documents• Draw a topic distribution 𝜃
• For i in N:
• Draw a topic from 𝜃
• Sample a word from the word distribution z
31
0,140,290,290,140,14
![Page 32: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/32.jpg)
Topics per Document
32 Topic 1
Document 10,04
0,650,130,130,04
Topic 2 Topic 3 Topic 4 Topic 5
Document 2 0,140,290,290,140,14
Document 3 0,170,330,170,170,17
…
Document N0,79
0,040,040,110,04
Document 4 0,200,070,070,200,47
…
𝜃 =P(topic|document)
![Page 33: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/33.jpg)
Words per Topic
33 words
Topic 4
Topic 3
Topic 2
Topic 1
TOPIC DESCRIPTORS
Topic 5
z =P(word|topic)
![Page 34: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/34.jpg)
Plate Notation
34
𝛃
M
N
𝜃i zij wij𝛂
#DOCUMENTS#WORDS
REPEATHOW SPECIFIC ARE WORDS TO TOPICS?
HOW MANY TOPICS PER DOCUMENT?
TOPIC DISTRO
WORD DISTRO
HYPERPARAMETERS
![Page 35: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/35.jpg)
Dirichlet Distributions
35
"DISTRIBUTION GENERATOR"
PEAKED PARETO UNIFORM
![Page 36: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/36.jpg)
Parameters: 𝛂
360.01
1000
0,79
0,040,040,110,04
0,190,210,200,190,21
MORE UNIFORM: EVERY TOPIC IN EVERY DOCUMENT
MORE PEAKED: ONE DOMINANT TOPIC/DOC
![Page 37: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/37.jpg)
Parameters: 𝛃
370.01
1000
WORDS ARE HIGHLY TOPIC-SPECIFIC
ALL WORDS FOR ALL TOPICS
![Page 38: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/38.jpg)
Training and Parameters
![Page 39: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/39.jpg)
Evaluating LDA
39
MODEL-INHERENT CONTENT-BASED
WORD INTRUSION
[apple, banana, pear, lime, orange]
[apple, banana, foot, lime, orange]
P(T)
FIT
PREDICT
X�
X
x
p(x) log p(x)
=2PERPLEXITY
WHICH ONE'S WRONG?
![Page 40: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/40.jpg)
Parameters: K
405 10 15 20
PICK LOWEST NUMBER WITH BEST SCORE
K=8
EVALUATION CRITERION
number of topics
![Page 41: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/41.jpg)
Coherence Scores
41
LOG PROB OF WORD CO-OCCURRENCES
NORMALIZED PMI AND COSINE SIMILARITY
![Page 42: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/42.jpg)
Word and Topic Intrusion
42 Slide credit: Hanh Nguyen
WORD INTRUSION
TOPIC INTRUSION
![Page 43: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/43.jpg)
Adding Constraints
• Maybe we know which words go with a topic
• Fix some probabilities/add smoothing
43
![Page 44: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/44.jpg)
Author Topic Models• Learn separate topic distribution for external factors
44
TOPICS BY COUNTRY
![Page 45: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/45.jpg)
Plate Notation
45
M
N
𝜃i zij wij
#DOCUMENTS#WORDSHOW MANY
TOPICS PER DOCUMENT?
𝛃
𝛂
![Page 46: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/46.jpg)
𝛂
𝛃
Plate Notation
46
M
N
ai xij zij wij
𝜃iT
#DOCUMENTS#WORDSAUTHOR AUTHOR
DISTRO
HOW MANY TOPICS PER DOCUMENT?
![Page 47: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/47.jpg)
Wrapping Up
![Page 48: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/48.jpg)
How to use Topic Models
48
CORPUS MODEL TOPICSDESCRIPTORS
- find best #topics - find best parameters - check output
- preprocess - choose top 5 words - name
[pasta, pizza, wine, sauce, spaghetti]
FOOD
![Page 49: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/49.jpg)
Caveats!
Topic models ALWAYS need manual assessment, because:
• Random initialization: no two models are the same!
• More likely models ≠ more interpretable topics
• "Interpretable" is subjective
• Topics are not stable from run to run
49
NEVER USE TOPICS AS INPUT TO REGRESSION!
![Page 50: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/50.jpg)
• LDA is one architecture for topic models
• Model document generation conditioned on latent topics
• Topic models are stochastic: each run is different
• Preprocessing and parameters influence performance
• Results need to be interpreted!
• We can introduce constraints through priors or labels
Take-Home Points
50
![Page 51: topic models - socialdatascience.network](https://reader033.vdocuments.net/reader033/viewer/2022042221/625a508277b39c6c5e0bf470/html5/thumbnails/51.jpg)
To Neural and Beyond
51
https://github.com/MilaNLProc/contextualized-topic-models
• Based on neural networks: better coherence
• cross-lingual: train in one language, use in others
• add supervision: use document labels (similar to author topics)
Sentence Topic
ENITPT
Blackmore’s Night is a British/American traditional folk rock duo [...]Blackmore’s Night sono la band fondatrice del renaissance rock [...]Blackmore’s Night ́e uma banda de folk rock de estilo [...]
rock, band, bass, formedrock, band, bass, formedrock, band, bass, formed
ENFRDE
Langton’s ant is a two-dimensional Turing machine with [...]On nomme fourmi de Langton un automate cellulaire [...]Die Ameise ist eine Turingmaschine mit einem [...]
math, theory, space, numbersmath, theory, space, numbersmath, theory, space, numbers