models for authors and text documents
DESCRIPTION
Models for Authors and Text Documents. Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford). - PowerPoint PPT PresentationTRANSCRIPT
Models for Authors Models for Authors and Text and Text
DocumentsDocumentsMark SteyversMark Steyvers
UCI UCI
In collaboration with: In collaboration with: Padhraic Smyth (UCI)Padhraic Smyth (UCI)
Michal Rosen-Zvi (UCI)Michal Rosen-Zvi (UCI)
Thomas GriffithsThomas Griffiths (Stanford) (Stanford)
These viewgraphs were These viewgraphs were developed by Professor Mark developed by Professor Mark Steyvers and are intended for Steyvers and are intended for review by ICS 278 students. If review by ICS 278 students. If you wish to use them for any you wish to use them for any
other purposes please contact other purposes please contact Professor Smyth (Professor Smyth (
[email protected]@ics.uci.edu) or ) or Professor Steyvers Professor Steyvers
GoalGoal Automatically extract topical content of documentsAutomatically extract topical content of documents
Learn association of topics to authors of Learn association of topics to authors of documentsdocuments
Propose new efficient probabilistic topic model: Propose new efficient probabilistic topic model: the author-topic modelthe author-topic model
Some queries that model should be able to answer: Some queries that model should be able to answer: What topics does author What topics does author XX work on? work on? Which authors work on topic Which authors work on topic XX? ? What are interesting temporal patterns in topics?What are interesting temporal patterns in topics?
A topic is represented as a A topic is represented as a (multinomial) distribution over (multinomial) distribution over
wordswords
)|( zwP
WORD PROB.
PROBABILISTIC 0.0778
BAYESIAN 0.0671
PROBABILITY 0.0532
CARLO 0.0309
MONTE 0.0308
DISTRIBUTION 0.0257
INFERENCE 0.0253
PROBABILITIES 0.0253
CONDITIONAL 0.0229
PRIOR 0.0219
.... ...
TOPIC 209
WORD PROB.
RETRIEVAL 0.1179
TEXT 0.0853
DOCUMENTS 0.0527
INFORMATION 0.0504
DOCUMENT 0.0441
CONTENT 0.0242
INDEXING 0.0205
RELEVANCE 0.0159
COLLECTION 0.0146
RELEVANT 0.0136
... ...
TOPIC 289
Documents as Topics Mixtures:Documents as Topics Mixtures:a Geometric Interpretationa Geometric Interpretation
P(word3)
P(word1)
0
1
1
1
P(word2)
P(word1)+P(word2)+P(word3) = 1
topic 1
topic 2
= document= document
Previous topic-based Previous topic-based models models
Hoffman (1999): Probabilistic Latent Semantic Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI)Indexing (pLSI) EM implementationEM implementation Problem of overfittingProblem of overfitting
Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA)(LDA) Clarified the pLSI modelClarified the pLSI model Variational EMVariational EM
Griffiths & Steyvers, (PNAS 2004)Griffiths & Steyvers, (PNAS 2004) Same generative model as LDASame generative model as LDA Gibbs sampling technique for inferenceGibbs sampling technique for inference
Computationally simpleComputationally simple Efficient (linear with size of data)Efficient (linear with size of data) Can be applied to >100K documents Can be applied to >100K documents
Approach with Author-Approach with Author-Topic ModelsTopic Models
Combine author models with topic models Combine author models with topic models Ignore Ignore stylestyle, focus on , focus on contentcontent of document of document Learn the Learn the topicstopics that authors write about that authors write about
Learn two matrices:Learn two matrices:
Authors
Top
ics
Topics
Word
s
Assumptions of Generative Assumptions of Generative ModelModel
Each author is associated with a topics mixtureEach author is associated with a topics mixture
Each document contains a mixture of topicsEach document contains a mixture of topics
With multiple authors, the document will express With multiple authors, the document will express a mixture of the topics mixtures of the co-authorsa mixture of the topics mixtures of the co-authors
Each word in a text is generated from Each word in a text is generated from oneone topic topic and and oneone author author (potentially different for each word)(potentially different for each word)
Generative ProcessGenerative Process
Let’s assume authors Let’s assume authors AA11 and and AA22 collaborate and collaborate and produce a paperproduce a paper AA11 has multinomial topic distribution has multinomial topic distribution
AA22 has multinomial topic distribution has multinomial topic distribution
For each word in the paper:For each word in the paper:
1.1. Sample an author Sample an author xx (uniformly) from (uniformly) from AA11,, AA22
2.2. Sample a topic Sample a topic z z from a from a XX
3.3. Sample a word Sample a word ww from a multinomial topic distribution from a multinomial topic distribution zz
Graphical ModelGraphical Model
1. Choose an author
2. Choose a topic
3. Choose a word
From the set of co-authors …
x
z
w
D
A
T
da
Nd
Matrix of author-topicdistributions
Matrix of topic-worddistributions
Model EstimationModel Estimation
Estimate Estimate xx and and zz by Gibbs samplingby Gibbs sampling ((assignments of each word to an author and assignments of each word to an author and topic)topic)
Integrate out Integrate out and and
Estimation is efficient: linear in data sizeEstimation is efficient: linear in data size
Infer:Infer: Author-Topic distributionsAuthor-Topic distributions ( ( Topic-Word distributionsTopic-Word distributions
Gibbs sampling in Author-Gibbs sampling in Author-TopicsTopics
Need full conditional distributions for variablesNeed full conditional distributions for variables The probability of assigning the current word The probability of assigning the current word ii to to
topic topic jj and author and author kk given everything else: given everything else:
number of times word w assigned to topic j
number of times topic j assigned to author k
' '' '
),,,,|,(j
ATkj
ATmj
m
WTjm
WTmj
diiiiii TC
C
VC
CmwkxjzP
awxz
WTmjC
ATkjC
Gibbs sampling procedureGibbs sampling procedure
...i wi di
1 Bayesian 12 Probability 13 Monte 14 Carlo 15 Methods 16 Inference 17 Procedures 18 Information 29 Retrieval 210 Text 211 Document 2... ... ...50 Collection 2
Start with random assignments to Start with random assignments to topics/authorstopics/authors
...i wi di zi xi
1 Bayesian 1 49 32 Probability 1 54 13 Monte 1 11 24 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5
ITERATION 1
Use all previous assignments, Use all previous assignments, except for current word-tokenexcept for current word-token
' '' '
),,,,|,(j
ATkj
ATmj
m
WTjm
WTmj
diiiiii TC
C
VC
CmwkxjzP
awxz
...i wi di zi xi zi xi
1 Bayesian 1 49 3 ? ?2 Probability 1 54 13 Monte 1 11 24 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5
ITERATION 1 ITERATION 2
Sample topic and author, Sample topic and author, and move to next word-tokenand move to next word-token
' '' '
),,,,|,(j
ATkj
ATmj
m
WTjm
WTmj
diiiiii TC
C
VC
CmwkxjzP
awxz
...i wi di zi xi zi xi
1 Bayesian 1 49 3 42 22 Probability 1 54 1 ? ?3 Monte 1 11 24 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5
ITERATION 1 ITERATION 2
Sample topic and author, Sample topic and author, and move to next word-tokenand move to next word-token
' '' '
),,,,|,(j
ATkj
ATmj
m
WTjm
WTmj
diiiiii TC
C
VC
CmwkxjzP
awxz
...i wi di zi xi zi xi
1 Bayesian 1 49 3 42 22 Probability 1 54 1 56 43 Monte 1 11 2 ? ?4 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5
ITERATION 1 ITERATION 2
Sample topic and author, Sample topic and author, and move to next word-tokenand move to next word-token
' '' '
),,,,|,(j
ATkj
ATmj
m
WTjm
WTmj
diiiiii TC
C
VC
CmwkxjzP
awxz
...i wi di zi xi zi xi
1 Bayesian 1 49 3 42 22 Probability 1 54 1 56 43 Monte 1 11 2 46 14 Carlo 1 61 3 ? ?5 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5
ITERATION 1 ITERATION 2
Collect samples after >1000 Collect samples after >1000 iterationsiterations
...i wi di zi xi zi xi zi xi
1 Bayesian 1 49 3 49 3 49 32 Probability 1 54 1 54 1 54 13 Monte 1 11 2 11 2 11 24 Carlo 1 61 3 61 3 61 35 Methods 1 77 4 77 4 77 46 Inference 1 46 1 46 1 46 17 Procedures 1 77 4 77 4 77 48 Information 2 14 5 14 5 14 59 Retrieval 2 91 5 91 5 91 510 Text 2 49 5 49 5 49 511 Document 2 27 5 27 5 27 5... ... ...50 Collection 2 53 5 53 5 53 5
ITERATION 1 ITERATION 2 ITERATION 2000
DataData CorporaCorpora
CiteSeer:CiteSeer: 160K 160K abstracts, abstracts, 85K authors85K authors NIPS:NIPS: 1.7K 1.7K papers, papers, 2K 2K authorsauthors Enron:Enron: 115K 115K emails, emails, 5K 5K authors authors
(sender)(sender)
Removed stop words; no stemmingRemoved stop words; no stemming
Word order is irrelevant, just use word countsWord order is irrelevant, just use word counts
Processing time:Processing time:Nips: 2000 Gibbs iterations Nips: 2000 Gibbs iterations 12 hours on PC workstation 12 hours on PC workstation
CiteSeer: 700 Gibbs iterations CiteSeer: 700 Gibbs iterations 111 hours 111 hours
Four example topics from Four example topics from CiteSeer (T=300)CiteSeer (T=300)
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
SPEECH 0.1134 PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164
RECOGNITION 0.0349 BAYESIAN 0.0671 INTERFACE 0.1080 OBSERVATIONS 0.0150
WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150
SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145
ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144
RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134
SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124
SOUND 0.0127 PROBABILITIES 0.0253 VISUAL 0.0203 OBSERVED 0.0108
TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101
MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143
Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131
Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089
Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083
Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078
Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067
Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063
Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059
Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055
Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050
TOPIC 10 TOPIC 209 TOPIC 87 TOPIC 20
Four example topics from CiteSeer Four example topics from CiteSeer (T=300)(T=300)
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
DATA 0.1563 PROBABILISTIC 0.0778 RETRIEVAL 0.1179 QUERY 0.1848
MINING 0.0674 BAYESIAN 0.0671 TEXT 0.0853 QUERIES 0.1367
ATTRIBUTES 0.0462 PROBABILITY 0.0532 DOCUMENTS 0.0527 INDEX 0.0488
DISCOVERY 0.0401 CARLO 0.0309 INFORMATION 0.0504 DATA 0.0368
ASSOCIATION 0.0335 MONTE 0.0308 DOCUMENT 0.0441 JOIN 0.0260
LARGE 0.0280 DISTRIBUTION 0.0257 CONTENT 0.0242 INDEXING 0.0180
KNOWLEDGE 0.0260 INFERENCE 0.0253 INDEXING 0.0205 PROCESSING 0.0113
DATABASES 0.0210 PROBABILITIES 0.0253 RELEVANCE 0.0159 AGGREGATE 0.0110
ATTRIBUTE 0.0188 CONDITIONAL 0.0229 COLLECTION 0.0146 ACCESS 0.0102
DATASETS 0.0165 PRIOR 0.0219 RELEVANT 0.0136 PRESENT 0.0095
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Han_J 0.0196 Friedman_N 0.0094 Oard_D 0.0110 Suciu_D 0.0102
Rastogi_R 0.0094 Heckerman_D 0.0067 Croft_W 0.0056 Naughton_J 0.0095
Zaki_M 0.0084 Ghahramani_Z 0.0062 Jones_K 0.0053 Levy_A 0.0071
Shim_K 0.0077 Koller_D 0.0062 Schauble_P 0.0051 DeWitt_D 0.0068
Ng_R 0.0060 Jordan_M 0.0059 Voorhees_E 0.0050 Wong_L 0.0067
Liu_B 0.0058 Neal_R 0.0055 Singhal_A 0.0048 Chakrabarti_K 0.0064
Mannila_H 0.0056 Raftery_A 0.0054 Hawking_D 0.0048 Ross_K 0.0061
Brin_S 0.0054 Lukasiewicz_T 0.0053 Merkl_D 0.0042 Hellerstein_J 0.0059
Liu_H 0.0047 Halpern_J 0.0052 Allan_J 0.0040 Lenzerini_M 0.0054
Holder_L 0.0044 Muller_P 0.0048 Doermann_D 0.0039 Moerkotte_G 0.0053
TOPIC 205 TOPIC 209 TOPIC 289 TOPIC 10
Four more topicsFour more topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
MATCHING 0.1295 USER 0.2541 STARS 0.0164 METHOD 0.5851
STRING 0.0552 INTERFACE 0.1080 OBSERVATIONS 0.0150 METHODS 0.3321
LENGTH 0.0536 USERS 0.0788 SOLAR 0.0150 APPLIED 0.0268
ALGORITHM 0.0471 INTERFACES 0.0433 MAGNETIC 0.0145 APPLYING 0.0056
PATTERN 0.0327 GRAPHICAL 0.0392 RAY 0.0144 ORIGINAL 0.0054
STRINGS 0.0316 INTERACTIVE 0.0354 EMISSION 0.0134 DEVELOPED 0.0051
WORD 0.0287 INTERACTION 0.0261 GALAXIES 0.0124 PROPOSE 0.0046
MATCH 0.0235 VISUAL 0.0203 OBSERVED 0.0108 COMBINES 0.0034
PROBLEM 0.0232 DISPLAY 0.0128 SUBJECT 0.0101 PRACTICAL 0.0031
TEXT 0.0217 MANIPULATION 0.0099 STAR 0.0087 APPLY 0.0029
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Navarro_G 0.0200 Shneiderman_B 0.0060 Linsky_J 0.0143 Yang_T 0.0014
Gasieniec_L 0.0121 Rauterberg_M 0.0031 Falcke_H 0.0131 Zhang_J 0.0014
Amir_A 0.0087 Lavana_H 0.0024 Mursula_K 0.0089 Loncaric_S 0.0014
Baker_B 0.0073 Pentland_A 0.0021 Butler_R 0.0083 Liu_Y 0.0013
Crochemore_M 0.0070 Myers_B 0.0021 Bjorkman_K 0.0078 Benner_P 0.0013
Baeza-Yates_R 0.0067 Minas_M 0.0021 Knapp_G 0.0067 Faloutsos_C 0.0013
Shinohara_A 0.0061 Burnett_M 0.0021 Kundu_M 0.0063 Cortadella_J 0.0012
Szpankowski_W 0.0056 Winiwarter_W 0.0020 Christensen_J 0.0059 Paige_R 0.0011
Rytter_W 0.0056 Chang_S 0.0019 Cranmer_S 0.0055 Tai_X 0.0011
Ferragina_P 0.0051 Korvemaker_B 0.0019 Nagar_N 0.0050 Lee_J 0.0011
TOPIC 163 TOPIC 87 TOPIC 20 TOPIC 273
Some topics relate to generic Some topics relate to generic word usageword usage
WORD PROB.
METHOD 0.5851
METHODS 0.3321
APPLIED 0.0268
APPLYING 0.0056
ORIGINAL 0.0054
DEVELOPED 0.0051
PROPOSE 0.0046
COMBINES 0.0034
PRACTICAL 0.0031
APPLY 0.0029
AUTHOR PROB.
Yang_T 0.0014
Zhang_J 0.0014
Loncaric_S 0.0014
Liu_Y 0.0013
Benner_P 0.0013
Faloutsos_C 0.0013
Cortadella_J 0.0012
Paige_R 0.0011
Tai_X 0.0011
Lee_J 0.0011
TOPIC 273
Some likely topics per author Some likely topics per author (CiteSeer)(CiteSeer)
Author = Andrew McCallum, U Mass:Author = Andrew McCallum, U Mass: Topic 1: classification, training, generalization, decision, data,…Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,…..Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,…Topic 3: retrieval, text, document, information, content,…
Author = Hector Garcia-Molina, Stanford:Author = Hector Garcia-Molina, Stanford:- - Topic 1: query, index, data, join, processing, aggregate….Topic 1: query, index, data, join, processing, aggregate….
- Topic 2: transaction, concurrency, copy, permission, distributed….- Topic 2: transaction, concurrency, copy, permission, distributed….
- Topic 3: source, separation, paper, heterogeneous, merging…..- Topic 3: source, separation, paper, heterogeneous, merging…..
Author = Paul Cohen, USC/ISI:Author = Paul Cohen, USC/ISI:- Topic 1: agent, multi, coordination, autonomous, intelligent….- Topic 1: agent, multi, coordination, autonomous, intelligent….
- Topic 2: planning, action, goal, world, execution, situation…- Topic 2: planning, action, goal, world, execution, situation…
- Topic 3: human, interaction, people, cognitive, social, natural….- Topic 3: human, interaction, people, cognitive, social, natural….
Four example topics from NIPS Four example topics from NIPS (T=100)(T=100)
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
LIKELIHOOD 0.0539 RECOGNITION 0.0400 REINFORCEMENT 0.0411 KERNEL 0.0683
MIXTURE 0.0509 CHARACTER 0.0336 POLICY 0.0371 SUPPORT 0.0377
EM 0.0470 CHARACTERS 0.0250 ACTION 0.0332 VECTOR 0.0257
DENSITY 0.0398 TANGENT 0.0241 OPTIMAL 0.0208 KERNELS 0.0217
GAUSSIAN 0.0349 HANDWRITTEN 0.0169 ACTIONS 0.0208 SET 0.0205
ESTIMATION 0.0314 DIGITS 0.0159 FUNCTION 0.0178 SVM 0.0204
LOG 0.0263 IMAGE 0.0157 REWARD 0.0165 SPACE 0.0188
MAXIMUM 0.0254 DISTANCE 0.0153 SUTTON 0.0164 MACHINES 0.0168
PARAMETERS 0.0209 DIGIT 0.0149 AGENT 0.0136 REGRESSION 0.0155
ESTIMATE 0.0204 HAND 0.0126 DECISION 0.0118 MARGIN 0.0151
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Tresp_V 0.0333 Simard_P 0.0694 Singh_S 0.1412 Smola_A 0.1033
Singer_Y 0.0281 Martin_G 0.0394 Barto_A 0.0471 Scholkopf_B 0.0730
Jebara_T 0.0207 LeCun_Y 0.0359 Sutton_R 0.0430 Burges_C 0.0489
Ghahramani_Z 0.0196 Denker_J 0.0278 Dayan_P 0.0324 Vapnik_V 0.0431
Ueda_N 0.0170 Henderson_D 0.0256 Parr_R 0.0314 Chapelle_O 0.0210
Jordan_M 0.0150 Revow_M 0.0229 Dietterich_T 0.0231 Cristianini_N 0.0185
Roweis_S 0.0123 Platt_J 0.0226 Tsitsiklis_J 0.0194 Ratsch_G 0.0172
Schuster_M 0.0104 Keeler_J 0.0192 Randlov_J 0.0167 Laskov_P 0.0169
Xu_L 0.0098 Rashid_M 0.0182 Bradtke_S 0.0161 Tipping_M 0.0153
Saul_L 0.0094 Sackinger_E 0.0132 Schwartz_A 0.0142 Sollich_P 0.0141
TOPIC 19 TOPIC 24 TOPIC 29 TOPIC 87
ENRON Email: two example topics ENRON Email: two example topics (T=100)(T=100)
WORD PROB.
BUSH 0.0227
LAY 0.0193
MR 0.0183
WHITE 0.0153
ENRON 0.0150
HOUSE 0.0148
PRESIDENT 0.0131
ADMINISTRATION 0.0115
COMPANY 0.0090
ENERGY 0.0085
SENDER PROB.
NELSON, KIMBERLY (ETS) 0.3608
PALMER, SARAH 0.0997
DENNE, KAREN 0.0541
HOTTE, STEVE 0.0340
DUPREE, DIANNA 0.0282
ARMSTRONG, JULIE 0.0222
LOKEY, TEB 0.0194
SULLIVAN, LORA 0.0073
VILLARREAL, LILLIAN 0.0040
BAGOT, NANCY 0.0026
TOPIC 10
WORD PROB.
ANDERSEN 0.0241
FIRM 0.0134
ACCOUNTING 0.0119
SEC 0.0065
SETTLEMENT 0.0062
AUDIT 0.0054
CORPORATE 0.0053
FINANCIAL 0.0052
JUSTICE 0.0052
INFORMATION 0.0050
SENDER PROB.
HILTABRAND, LESLIE 0.1359
WELLS, TORI L. 0.0865
DUPREE, DIANNA 0.0825
ARMSTRONG, JULIE 0.0316
DENNE, KAREN 0.0208
SULLIVAN, LORA 0.0072
[email protected] 0.0026
WILSON, DANNY 0.0016
HU, SYLVIA 0.0013
MATHEWS, LEENA 0.0012
TOPIC 32
ENRON Email: two topics not ENRON Email: two topics not about Enronabout Enron
WORD PROB.
TRAVEL 0.0161
ROUNDTRIP 0.0124
SAVE 0.0118
DEALS 0.0097
HOTEL 0.0095
BOOK 0.0094
SALE 0.0089
FARES 0.0083
TRIP 0.0072
CITIES 0.0070
SENDER PROB.
TRAVELOCITY MEMBER SERVICES 0.0763
BESTFARES.COM HOT DEALS 0.0502
<[email protected]> 0.0315
LISTS.COOLVACATIONS.COM 0.0151
CHEAP TICKETS 0.0111
EXPEDIA FARE TRACKER 0.0106
TRAVELOCITY.COM 0.0096
[email protected] 0.0088
[email protected] 0.0066
LASTMINUTE.COM 0.0051
TOPIC 38
WORD PROB.
NEWS 0.0245
MAIL 0.0182
NYTIMES 0.0149
YORK 0.0128
PAGE 0.0095
TIMES 0.0090
HEADLINES 0.0079
BUSH 0.0077
DELIVERY 0.0070
HTML 0.0068
SENDER PROB.
THE NEW YORK TIMES DIRECT 0.3438
<[email protected]> 0.0104
THE ECONOMIST 0.0029
@TIMES - INSIDE NYTIMES.COM 0.0015
[email protected] 0.0011
AMAZON.COM DELIVERS BESTSELLERS 0.0009
NYTIMES.COM 0.0009
HYATT, JERRY 0.0008
NEWSLETTER_TEXT 0.0008
CHRIS LONG 0.0007
TOPIC 25
Stability of TopicsStability of Topics
Content of topics is arbitrary across runs of Content of topics is arbitrary across runs of modelmodel(e.g., topic #1 is not the same across runs) (e.g., topic #1 is not the same across runs)
However, However, Majority of topics are stable over processing timeMajority of topics are stable over processing time Majority of topics can be aligned across runs Majority of topics can be aligned across runs
Topics represent genuine structure in data Topics represent genuine structure in data
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
2
4
6
8
10
12
14
16
Comparing NIPS topics from the same Comparing NIPS topics from the same Markov chainMarkov chain
KL
dist
ance
topics at t1=1000
Re-o
rdere
d t
op
ics
at
t 2=
2000
BEST KL = 0.54
WORST KL = 4.78
ANALOG .043 ANALOG .044CIRCUIT .040 CIRCUIT .040
CHIP .034 CHIP .037CURRENT .025 VOLTAGE .024VOLTAGE .023 CURRENT .023
VLSI .022 VLSI .023INPUT .018 OUTPUT .022
OUTPUT .018 INPUT .019CIRCUITS .015 CIRCUITS .015
FIGURE .014 PULSE .012PULSE .012 SYNAPSE .012
SYNAPSE .011 SILICON .011SILICON .011 FIGURE .010
CMOS .009 CMOS .009MEAD .008 GATE .009
t1 t2
FEEDBACK .040 ADAPTATION .051ADAPTATION .034 FIGURE .033
CORTEX .025 SIMULATION .026REGION .016 GAIN .025FIGURE .015 EFFECTS .016
FUNCTION .014 FIBERS .014BRAIN .013 COMPUTATIONAL .014
COMPUTATIONAL .013 EXPERIMENT .014FIBER .012 FIBER .013
FIBERS .011 SITES .012ELECTRIC .011 RESULTS .012
BOWER .010 EXPERIMENTS .012FISH .010 ELECTRIC .011
SIMULATIONS .009 SITE .009CEREBELLAR .009 NEURO .009
t1 t2
20 40 60 80 100
10
20
30
40
50
60
70
80
90
1002
4
6
8
10
12
14
16
Comparing NIPS topics from two different Markov Comparing NIPS topics from two different Markov chainschains
KL
dist
ance
topics from chain 1
Re-o
rdere
d t
op
ics
from
ch
ain
2BEST KL = 1.03
WORST KL = 9.49
MOTOR .041 MOTOR .040TRAJECTORY .031 ARM .030
ARM .027 TRAJECTORY .030HAND .022 HAND .024
MOVEMENT .022 MOVEMENT .023INVERSE .019 INVERSE .021
DYNAMICS .019 JOINT .021CONTROL .018 DYNAMICS .018
JOINT .018 CONTROL .015POSITION .017 POSITION .015
FORWARD .014 FORWARD .015TRAJECTORIES .014 FORCE .014
MOVEMENTS .013 TRAJECTORIES .013FORCE .012 MOVEMENTS .012
MUSCLE .011 CHANGE .010
Chain 1 Chain 2
ORDER .175 FUNCTION .091SCALE .053 ORDER .064
HIGHER .035 EQUATION .048MULTI .028 TERMS .027NOTE .028 TERM .027
VOLUME .019 THEORY .014TERMS .019 APPROXIMATION .014
STRUCTURE .017 FUNCTIONS .014SCALES .017 FORM .014
INVARIANT .012 OBTAINED .013SCALING .011 POINT .012
COMPLEXITY .010 RESPECT .012MUSIC .009 GENERAL .011NOTES .009 CASE .011TABLE .008 ASSUME .011
Chain 1 Chain 2
Detecting Papers on Unusual Topics Detecting Papers on Unusual Topics for Authorsfor Authors
We can calculate perplexity (unusualness) for We can calculate perplexity (unusualness) for words in a document given an author words in a document given an author
Papers ranked by perplexity for M. Jordan:
Author SeparationAuthor Separation
A method1 is described which like the kernel1 trick1 in support1 vector1 machines1 SVMs1 lets us generalize distance1 based2 algorithms to operate in feature1 spaces usually nonlinearly related to the input1 space This is done by identifying a class of kernels1 which can be represented as norm1 based2 distances1 in Hilbert spaces It turns1 out that common kernel1 algorithms such as SVMs1 and kernel1 PCA1 are actually really distance1 based2 algorithms and can be run2 with that class of kernels1 too As well as providing1 a useful new insight1 into how these algorithms work the present2 work can form the basis1 for conceiving new algorithms
This paper presents2 a comprehensive approach for model2 based2 diagnosis2 which includes proposals for characterizing and computing2 preferred2 diagnoses2 assuming that the system2 description2 is augmented with a system2 structure2 a directed2 graph2 explicating the interconnections between system2 components2 Specifically we first introduce the notion of a consequence2 which is a syntactically2 unconstrained propositional2 sentence2 that characterizes all consistency2 based2 diagnoses2 and show2 that standard2 characterizations of diagnoses2 such as minimal conflicts1 correspond to syntactic2 variations1 on a consequence2 Second we propose a new syntactic2 variation on the consequence2 known as negation2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm2 for computing consequences in NNF given a structured system2 description We show that if the system2 structure2 does not contain cycles2 then there is always a linear size2 consequence2 in NNF which can be computed in linear time2 For arbitrary1 system2 structures2 we show a precise connection between the complexity2 of computing2 consequences and the topology of the underlying system2 structure2 Finally we present2 an algorithm2 that enumerates2 the preferred2 diagnoses2 characterized by a consequence2 The algorithm2 is shown1 to take linear time2 in the size2 of the consequence2 if the preference criterion1 satisfies some general conditions
Written by(1) Scholkopf_B
Written by(2) Darwiche_A
Can model attribute words to authors correctly within a document?
Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author
Temporal patterns in topics: hot and Temporal patterns in topics: hot and cold topicscold topics
We have CiteSeer papers from 1986-2001We have CiteSeer papers from 1986-2001
We can calculate time-series for topicsWe can calculate time-series for topics Hot topics become more prevalentHot topics become more prevalent Cold topics become less prevalentCold topics become less prevalent
Do time-series correspond with known trends in Do time-series correspond with known trends in computer science?computer science?
1986 1988 1990 1992 1994 1996 1998 2000 20020
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
4
Year
Nu
mb
er o
f Doc
umen
tsDocument and Word Distribution by Year in the UCI CiteSeer Data
Nu
mb
er o
f Wo
rds
0
2
4
6
8
10
12
14x 10
5
Hot Topic: machine learning, data Hot Topic: machine learning, data miningmining
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7
8x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
114::regression:variance:estimator:estimators:bias:153::classification:training:classifier:classifiers:generalization:205::data:mining:attributes:discovery:association:
The inevitability of Bayes…The inevitability of Bayes…
1990 1992 1994 1996 1998 2000 20021.5
2
2.5
3
3.5
4
4.5
5
5.5x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
189::statistical:prediction:correlation:predict:statistics:209::probabilistic:bayesian:probability:carlo:monte:276::random:distribution:probability:markov:distributions:
Rise in Web/Mobile topicsRise in Web/Mobile topics
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
7::web:user:world:wide:users:80::mobile:wireless:devices:mobility:ad:76::java:remote:interface:platform:implementation:275::multicast:multimedia:media:delivery:applications:
(Not so) Hot Topics(Not so) Hot Topics
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
23::neural:networks:network:training:learning:35::wavelet:operator:operators:basis:coefficients:242::genetic:evolutionary:evolution:population:ga:
Decline in programming languages, Decline in programming languages, OS, ….OS, ….
1990 1992 1994 1996 1998 2000 20022
3
4
5
6
7
8
9
10
11x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
60::programming:language:concurrent:languages:implementation:139::system:operating:file:systems:kernel:283::collection:memory:persistent:garbage:stack:268::memory:cache:shared:access:performance:
Security research reborn….Security research reborn….
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7
8
9x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
120::security:secure:access:key:authentication:240::key:attack:encryption:hash:keys:
Decrease in use of Greek Decrease in use of Greek Letters Letters
1990 1992 1994 1996 1998 2000 20021.5
2
2.5
3
3.5
4
4.5
5x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
157::gamma:delta:ff:omega:oe:
Burst of French writing in Burst of French writing in mid 90’s?mid 90’s?
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
47::la:les:une:nous:est:
Comparison to models that use less Comparison to models that use less informationinformation
(topics, no authors)(topics, no authors)
x
wNd D
A
da
(authors, no topics)(authors, no topics)Topics modelTopics model Author modelAuthor model
z
wD
TNd
Matrix Factorization Matrix Factorization InterpretationInterpretation
Authors
Top
ics
Documents
Word
s
=
Documents
Au
thors A
Topics
Word
s
AUTHOR-TOPIC MODELAUTHOR-TOPIC MODEL
Documents
Top
ics
Documents
Word
s
=
Topics
Word
s
TOPIC MODELTOPIC MODEL
Documents
Word
s
=
Documents
Au
thors A
Author
Word
s
AUTHOR MODELAUTHOR MODEL
Comparison Results Comparison Results Train models on part Train models on part
of a new document of a new document and predict remaining and predict remaining wordswords
Without having seen Without having seen anyany words from new words from new document, author-document, author-topic information topic information helps in predicting helps in predicting words from that words from that documentdocument
Topics model is more Topics model is more flexible in adapting to flexible in adapting to new document after new document after observing a number observing a number of wordsof words
Per
plex
ity
(new
wor
ds)
2000
4000
6000
8000
10000
12000
14000
# Observed words in document
Author model
Topics model
Author-Topics
Author prediction with Author prediction with CiteSeerCiteSeer
Task: predict (single) author of new CiteSeer Task: predict (single) author of new CiteSeer abstractsabstracts
Results:Results: For 33% of documents, author guessed correctlyFor 33% of documents, author guessed correctly Median rank of true author = 26 (out of 85,000) Median rank of true author = 26 (out of 85,000)
Number of Topics
5 10 20 50 100 200 400 8002000
2500
3000
3500
4000
4500
5000
5500
0th
1st
2nd
5th
10th
Perplexities for true author Perplexities for true author and any random authorand any random author
A = true author
A = any author
The Author-The Author-Topic Topic
BrowserBrowser
(b)
(a)
(c)
Querying on author
Pazzani_M
Querying on topic
relevant to author
Querying on document written
by author
http://www.ics.uci.edu/~michal/KDD/ATM.htm
New Applications/ Future New Applications/ Future WorkWork
Finding relevant email:Finding relevant email: "find emails similar to this email based on content”"find emails similar to this email based on content” "find people who wrote emails similar in content"find people who wrote emails similar in content
to this one" to this one"
Reviewer RecommendationReviewer Recommendation ““Find reviewers for this set of NSF proposals who are active Find reviewers for this set of NSF proposals who are active
in relevant topics and have no conflicts of interest”in relevant topics and have no conflicts of interest”
Change Detection/MonitoringChange Detection/Monitoring Which authors are on the leading edge of new topics?Which authors are on the leading edge of new topics? Characterize the “topic trajectory” of this author over time Characterize the “topic trajectory” of this author over time
Author IdentificationAuthor Identification Who wrote this document? Incorporation of stylistic Who wrote this document? Incorporation of stylistic
informationinformation
20 40 60 80 100
10
20
30
40
50
60
70
80
90
1004
6
8
10
12
14
16
18
Comparing NIPS topics and CiteSeer topicsComparing NIPS topics and CiteSeer topics
KL
dist
ance
NIPS topics
Re-o
rdere
d C
iteS
eer
top
ics
KL = 2.88
KL = 4.48
MODEL .493 MODEL .498MODELS .143 MODELS .227
MODELING .022 MODELING .055PARAMETERS .020 DYNAMIC .009
BASED .012 MODELED .008PROPOSED .010 FRAMEWORK .007
NIPS CiteSeer
SPEECH .082 SPEECH .058RECOGNITION .049 RECOGNITION .047
HMM .023 WORD .018SPEAKER .022 SYSTEM .014CONTEXT .022 SPEAKER .012
WORD .016 ACOUSTIC .010
NIPS CiteSeer
SYSTEM .234 SYSTEM .497SYSTEMS .090 SYSTEMS .350
REAL .020 BASED .012BASED .018 PAPER .012
COMPUTER .014 COMPLEX .010APPROACH .011 DEVELOPED .008
NIPS CiteSeer
KL = 4.92
FUNCTION .159 FUNCTIONS .124FUNCTIONS .115 FUNCTION .118
APPROXIMATION .069 ORDER .023LINEAR .026 APPROXIMATION .022
BASIS .018 LINEAR .016APPROXIMATE .016 INTERVAL .014
NIPS CiteSeer
KL = 5.0