models for authors and text documents

51
Models for Authors Models for Authors and Text Documents and Text Documents Mark Steyvers Mark Steyvers UCI UCI In collaboration with: In collaboration with: Padhraic Smyth (UCI) Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths Thomas Griffiths (Stanford) (Stanford)

Upload: kareem

Post on 29-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Models for Authors and Text Documents. Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Models for Authors and Text Documents

Models for Authors Models for Authors and Text and Text

DocumentsDocumentsMark SteyversMark Steyvers

UCI UCI

In collaboration with: In collaboration with: Padhraic Smyth (UCI)Padhraic Smyth (UCI)

Michal Rosen-Zvi (UCI)Michal Rosen-Zvi (UCI)

Thomas GriffithsThomas Griffiths (Stanford) (Stanford)

Page 2: Models for Authors and Text Documents

These viewgraphs were These viewgraphs were developed by Professor Mark developed by Professor Mark Steyvers and are intended for Steyvers and are intended for review by ICS 278 students. If review by ICS 278 students. If you wish to use them for any you wish to use them for any

other purposes please contact other purposes please contact Professor Smyth (Professor Smyth (

[email protected]@ics.uci.edu) or ) or Professor Steyvers Professor Steyvers

([email protected]) ([email protected])

Page 3: Models for Authors and Text Documents

GoalGoal Automatically extract topical content of documentsAutomatically extract topical content of documents

Learn association of topics to authors of Learn association of topics to authors of documentsdocuments

Propose new efficient probabilistic topic model: Propose new efficient probabilistic topic model: the author-topic modelthe author-topic model

Some queries that model should be able to answer: Some queries that model should be able to answer: What topics does author What topics does author XX work on? work on? Which authors work on topic Which authors work on topic XX? ? What are interesting temporal patterns in topics?What are interesting temporal patterns in topics?

Page 4: Models for Authors and Text Documents

A topic is represented as a A topic is represented as a (multinomial) distribution over (multinomial) distribution over

wordswords

)|( zwP

WORD PROB.

PROBABILISTIC 0.0778

BAYESIAN 0.0671

PROBABILITY 0.0532

CARLO 0.0309

MONTE 0.0308

DISTRIBUTION 0.0257

INFERENCE 0.0253

PROBABILITIES 0.0253

CONDITIONAL 0.0229

PRIOR 0.0219

.... ...

TOPIC 209

WORD PROB.

RETRIEVAL 0.1179

TEXT 0.0853

DOCUMENTS 0.0527

INFORMATION 0.0504

DOCUMENT 0.0441

CONTENT 0.0242

INDEXING 0.0205

RELEVANCE 0.0159

COLLECTION 0.0146

RELEVANT 0.0136

... ...

TOPIC 289

Page 5: Models for Authors and Text Documents

Documents as Topics Mixtures:Documents as Topics Mixtures:a Geometric Interpretationa Geometric Interpretation

P(word3)

P(word1)

0

1

1

1

P(word2)

P(word1)+P(word2)+P(word3) = 1

topic 1

topic 2

= document= document

Page 6: Models for Authors and Text Documents

Previous topic-based Previous topic-based models models

Hoffman (1999): Probabilistic Latent Semantic Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI)Indexing (pLSI) EM implementationEM implementation Problem of overfittingProblem of overfitting

Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA)(LDA) Clarified the pLSI modelClarified the pLSI model Variational EMVariational EM

Griffiths & Steyvers, (PNAS 2004)Griffiths & Steyvers, (PNAS 2004) Same generative model as LDASame generative model as LDA Gibbs sampling technique for inferenceGibbs sampling technique for inference

Computationally simpleComputationally simple Efficient (linear with size of data)Efficient (linear with size of data) Can be applied to >100K documents Can be applied to >100K documents

Page 7: Models for Authors and Text Documents

Approach with Author-Approach with Author-Topic ModelsTopic Models

Combine author models with topic models Combine author models with topic models Ignore Ignore stylestyle, focus on , focus on contentcontent of document of document Learn the Learn the topicstopics that authors write about that authors write about

Learn two matrices:Learn two matrices:

Authors

Top

ics

Topics

Word

s

Page 8: Models for Authors and Text Documents

Assumptions of Generative Assumptions of Generative ModelModel

Each author is associated with a topics mixtureEach author is associated with a topics mixture

Each document contains a mixture of topicsEach document contains a mixture of topics

With multiple authors, the document will express With multiple authors, the document will express a mixture of the topics mixtures of the co-authorsa mixture of the topics mixtures of the co-authors

Each word in a text is generated from Each word in a text is generated from oneone topic topic and and oneone author author (potentially different for each word)(potentially different for each word)

Page 9: Models for Authors and Text Documents

Generative ProcessGenerative Process

Let’s assume authors Let’s assume authors AA11 and and AA22 collaborate and collaborate and produce a paperproduce a paper AA11 has multinomial topic distribution has multinomial topic distribution

AA22 has multinomial topic distribution has multinomial topic distribution

For each word in the paper:For each word in the paper:

1.1. Sample an author Sample an author xx (uniformly) from (uniformly) from AA11,, AA22

2.2. Sample a topic Sample a topic z z from a from a XX

3.3. Sample a word Sample a word ww from a multinomial topic distribution from a multinomial topic distribution zz

Page 10: Models for Authors and Text Documents

Graphical ModelGraphical Model

1. Choose an author

2. Choose a topic

3. Choose a word

From the set of co-authors …

x

z

w

D

A

T

da

Nd

Matrix of author-topicdistributions

Matrix of topic-worddistributions

Page 11: Models for Authors and Text Documents

Model EstimationModel Estimation

Estimate Estimate xx and and zz by Gibbs samplingby Gibbs sampling ((assignments of each word to an author and assignments of each word to an author and topic)topic)

Integrate out Integrate out and and

Estimation is efficient: linear in data sizeEstimation is efficient: linear in data size

Infer:Infer: Author-Topic distributionsAuthor-Topic distributions ( ( Topic-Word distributionsTopic-Word distributions

Page 12: Models for Authors and Text Documents

Gibbs sampling in Author-Gibbs sampling in Author-TopicsTopics

Need full conditional distributions for variablesNeed full conditional distributions for variables The probability of assigning the current word The probability of assigning the current word ii to to

topic topic jj and author and author kk given everything else: given everything else:

number of times word w assigned to topic j

number of times topic j assigned to author k

' '' '

),,,,|,(j

ATkj

ATmj

m

WTjm

WTmj

diiiiii TC

C

VC

CmwkxjzP

awxz

WTmjC

ATkjC

Page 13: Models for Authors and Text Documents

Gibbs sampling procedureGibbs sampling procedure

...i wi di

1 Bayesian 12 Probability 13 Monte 14 Carlo 15 Methods 16 Inference 17 Procedures 18 Information 29 Retrieval 210 Text 211 Document 2... ... ...50 Collection 2

Page 14: Models for Authors and Text Documents

Start with random assignments to Start with random assignments to topics/authorstopics/authors

...i wi di zi xi

1 Bayesian 1 49 32 Probability 1 54 13 Monte 1 11 24 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5

ITERATION 1

Page 15: Models for Authors and Text Documents

Use all previous assignments, Use all previous assignments, except for current word-tokenexcept for current word-token

' '' '

),,,,|,(j

ATkj

ATmj

m

WTjm

WTmj

diiiiii TC

C

VC

CmwkxjzP

awxz

...i wi di zi xi zi xi

1 Bayesian 1 49 3 ? ?2 Probability 1 54 13 Monte 1 11 24 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5

ITERATION 1 ITERATION 2

Page 16: Models for Authors and Text Documents

Sample topic and author, Sample topic and author, and move to next word-tokenand move to next word-token

' '' '

),,,,|,(j

ATkj

ATmj

m

WTjm

WTmj

diiiiii TC

C

VC

CmwkxjzP

awxz

...i wi di zi xi zi xi

1 Bayesian 1 49 3 42 22 Probability 1 54 1 ? ?3 Monte 1 11 24 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5

ITERATION 1 ITERATION 2

Page 17: Models for Authors and Text Documents

Sample topic and author, Sample topic and author, and move to next word-tokenand move to next word-token

' '' '

),,,,|,(j

ATkj

ATmj

m

WTjm

WTmj

diiiiii TC

C

VC

CmwkxjzP

awxz

...i wi di zi xi zi xi

1 Bayesian 1 49 3 42 22 Probability 1 54 1 56 43 Monte 1 11 2 ? ?4 Carlo 1 61 35 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5

ITERATION 1 ITERATION 2

Page 18: Models for Authors and Text Documents

Sample topic and author, Sample topic and author, and move to next word-tokenand move to next word-token

' '' '

),,,,|,(j

ATkj

ATmj

m

WTjm

WTmj

diiiiii TC

C

VC

CmwkxjzP

awxz

...i wi di zi xi zi xi

1 Bayesian 1 49 3 42 22 Probability 1 54 1 56 43 Monte 1 11 2 46 14 Carlo 1 61 3 ? ?5 Methods 1 77 46 Inference 1 46 17 Procedures 1 77 48 Information 2 14 59 Retrieval 2 91 510 Text 2 49 511 Document 2 27 5... ... ...50 Collection 2 53 5

ITERATION 1 ITERATION 2

Page 19: Models for Authors and Text Documents

Collect samples after >1000 Collect samples after >1000 iterationsiterations

...i wi di zi xi zi xi zi xi

1 Bayesian 1 49 3 49 3 49 32 Probability 1 54 1 54 1 54 13 Monte 1 11 2 11 2 11 24 Carlo 1 61 3 61 3 61 35 Methods 1 77 4 77 4 77 46 Inference 1 46 1 46 1 46 17 Procedures 1 77 4 77 4 77 48 Information 2 14 5 14 5 14 59 Retrieval 2 91 5 91 5 91 510 Text 2 49 5 49 5 49 511 Document 2 27 5 27 5 27 5... ... ...50 Collection 2 53 5 53 5 53 5

ITERATION 1 ITERATION 2 ITERATION 2000

Page 20: Models for Authors and Text Documents

DataData CorporaCorpora

CiteSeer:CiteSeer: 160K 160K abstracts, abstracts, 85K authors85K authors NIPS:NIPS: 1.7K 1.7K papers, papers, 2K 2K authorsauthors Enron:Enron: 115K 115K emails, emails, 5K 5K authors authors

(sender)(sender)

Removed stop words; no stemmingRemoved stop words; no stemming

Word order is irrelevant, just use word countsWord order is irrelevant, just use word counts

Processing time:Processing time:Nips: 2000 Gibbs iterations Nips: 2000 Gibbs iterations 12 hours on PC workstation 12 hours on PC workstation

CiteSeer: 700 Gibbs iterations CiteSeer: 700 Gibbs iterations 111 hours 111 hours

Page 21: Models for Authors and Text Documents

Four example topics from Four example topics from CiteSeer (T=300)CiteSeer (T=300)

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

SPEECH 0.1134 PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164

RECOGNITION 0.0349 BAYESIAN 0.0671 INTERFACE 0.1080 OBSERVATIONS 0.0150

WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150

SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145

ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144

RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134

SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124

SOUND 0.0127 PROBABILITIES 0.0253 VISUAL 0.0203 OBSERVED 0.0108

TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101

MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.

Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143

Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131

Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089

Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083

Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078

Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067

Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063

Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059

Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055

Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050

TOPIC 10 TOPIC 209 TOPIC 87 TOPIC 20

Page 22: Models for Authors and Text Documents

Four example topics from CiteSeer Four example topics from CiteSeer (T=300)(T=300)

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

DATA 0.1563 PROBABILISTIC 0.0778 RETRIEVAL 0.1179 QUERY 0.1848

MINING 0.0674 BAYESIAN 0.0671 TEXT 0.0853 QUERIES 0.1367

ATTRIBUTES 0.0462 PROBABILITY 0.0532 DOCUMENTS 0.0527 INDEX 0.0488

DISCOVERY 0.0401 CARLO 0.0309 INFORMATION 0.0504 DATA 0.0368

ASSOCIATION 0.0335 MONTE 0.0308 DOCUMENT 0.0441 JOIN 0.0260

LARGE 0.0280 DISTRIBUTION 0.0257 CONTENT 0.0242 INDEXING 0.0180

KNOWLEDGE 0.0260 INFERENCE 0.0253 INDEXING 0.0205 PROCESSING 0.0113

DATABASES 0.0210 PROBABILITIES 0.0253 RELEVANCE 0.0159 AGGREGATE 0.0110

ATTRIBUTE 0.0188 CONDITIONAL 0.0229 COLLECTION 0.0146 ACCESS 0.0102

DATASETS 0.0165 PRIOR 0.0219 RELEVANT 0.0136 PRESENT 0.0095

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.

Han_J 0.0196 Friedman_N 0.0094 Oard_D 0.0110 Suciu_D 0.0102

Rastogi_R 0.0094 Heckerman_D 0.0067 Croft_W 0.0056 Naughton_J 0.0095

Zaki_M 0.0084 Ghahramani_Z 0.0062 Jones_K 0.0053 Levy_A 0.0071

Shim_K 0.0077 Koller_D 0.0062 Schauble_P 0.0051 DeWitt_D 0.0068

Ng_R 0.0060 Jordan_M 0.0059 Voorhees_E 0.0050 Wong_L 0.0067

Liu_B 0.0058 Neal_R 0.0055 Singhal_A 0.0048 Chakrabarti_K 0.0064

Mannila_H 0.0056 Raftery_A 0.0054 Hawking_D 0.0048 Ross_K 0.0061

Brin_S 0.0054 Lukasiewicz_T 0.0053 Merkl_D 0.0042 Hellerstein_J 0.0059

Liu_H 0.0047 Halpern_J 0.0052 Allan_J 0.0040 Lenzerini_M 0.0054

Holder_L 0.0044 Muller_P 0.0048 Doermann_D 0.0039 Moerkotte_G 0.0053

TOPIC 205 TOPIC 209 TOPIC 289 TOPIC 10

Page 23: Models for Authors and Text Documents

Four more topicsFour more topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

MATCHING 0.1295 USER 0.2541 STARS 0.0164 METHOD 0.5851

STRING 0.0552 INTERFACE 0.1080 OBSERVATIONS 0.0150 METHODS 0.3321

LENGTH 0.0536 USERS 0.0788 SOLAR 0.0150 APPLIED 0.0268

ALGORITHM 0.0471 INTERFACES 0.0433 MAGNETIC 0.0145 APPLYING 0.0056

PATTERN 0.0327 GRAPHICAL 0.0392 RAY 0.0144 ORIGINAL 0.0054

STRINGS 0.0316 INTERACTIVE 0.0354 EMISSION 0.0134 DEVELOPED 0.0051

WORD 0.0287 INTERACTION 0.0261 GALAXIES 0.0124 PROPOSE 0.0046

MATCH 0.0235 VISUAL 0.0203 OBSERVED 0.0108 COMBINES 0.0034

PROBLEM 0.0232 DISPLAY 0.0128 SUBJECT 0.0101 PRACTICAL 0.0031

TEXT 0.0217 MANIPULATION 0.0099 STAR 0.0087 APPLY 0.0029

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.

Navarro_G 0.0200 Shneiderman_B 0.0060 Linsky_J 0.0143 Yang_T 0.0014

Gasieniec_L 0.0121 Rauterberg_M 0.0031 Falcke_H 0.0131 Zhang_J 0.0014

Amir_A 0.0087 Lavana_H 0.0024 Mursula_K 0.0089 Loncaric_S 0.0014

Baker_B 0.0073 Pentland_A 0.0021 Butler_R 0.0083 Liu_Y 0.0013

Crochemore_M 0.0070 Myers_B 0.0021 Bjorkman_K 0.0078 Benner_P 0.0013

Baeza-Yates_R 0.0067 Minas_M 0.0021 Knapp_G 0.0067 Faloutsos_C 0.0013

Shinohara_A 0.0061 Burnett_M 0.0021 Kundu_M 0.0063 Cortadella_J 0.0012

Szpankowski_W 0.0056 Winiwarter_W 0.0020 Christensen_J 0.0059 Paige_R 0.0011

Rytter_W 0.0056 Chang_S 0.0019 Cranmer_S 0.0055 Tai_X 0.0011

Ferragina_P 0.0051 Korvemaker_B 0.0019 Nagar_N 0.0050 Lee_J 0.0011

TOPIC 163 TOPIC 87 TOPIC 20 TOPIC 273

Page 24: Models for Authors and Text Documents

Some topics relate to generic Some topics relate to generic word usageword usage

WORD PROB.

METHOD 0.5851

METHODS 0.3321

APPLIED 0.0268

APPLYING 0.0056

ORIGINAL 0.0054

DEVELOPED 0.0051

PROPOSE 0.0046

COMBINES 0.0034

PRACTICAL 0.0031

APPLY 0.0029

AUTHOR PROB.

Yang_T 0.0014

Zhang_J 0.0014

Loncaric_S 0.0014

Liu_Y 0.0013

Benner_P 0.0013

Faloutsos_C 0.0013

Cortadella_J 0.0012

Paige_R 0.0011

Tai_X 0.0011

Lee_J 0.0011

TOPIC 273

Page 25: Models for Authors and Text Documents

Some likely topics per author Some likely topics per author (CiteSeer)(CiteSeer)

Author = Andrew McCallum, U Mass:Author = Andrew McCallum, U Mass: Topic 1: classification, training, generalization, decision, data,…Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,…..Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,…Topic 3: retrieval, text, document, information, content,…

Author = Hector Garcia-Molina, Stanford:Author = Hector Garcia-Molina, Stanford:- - Topic 1: query, index, data, join, processing, aggregate….Topic 1: query, index, data, join, processing, aggregate….

- Topic 2: transaction, concurrency, copy, permission, distributed….- Topic 2: transaction, concurrency, copy, permission, distributed….

- Topic 3: source, separation, paper, heterogeneous, merging…..- Topic 3: source, separation, paper, heterogeneous, merging…..

Author = Paul Cohen, USC/ISI:Author = Paul Cohen, USC/ISI:- Topic 1: agent, multi, coordination, autonomous, intelligent….- Topic 1: agent, multi, coordination, autonomous, intelligent….

- Topic 2: planning, action, goal, world, execution, situation…- Topic 2: planning, action, goal, world, execution, situation…

- Topic 3: human, interaction, people, cognitive, social, natural….- Topic 3: human, interaction, people, cognitive, social, natural….

Page 26: Models for Authors and Text Documents

Four example topics from NIPS Four example topics from NIPS (T=100)(T=100)

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

LIKELIHOOD 0.0539 RECOGNITION 0.0400 REINFORCEMENT 0.0411 KERNEL 0.0683

MIXTURE 0.0509 CHARACTER 0.0336 POLICY 0.0371 SUPPORT 0.0377

EM 0.0470 CHARACTERS 0.0250 ACTION 0.0332 VECTOR 0.0257

DENSITY 0.0398 TANGENT 0.0241 OPTIMAL 0.0208 KERNELS 0.0217

GAUSSIAN 0.0349 HANDWRITTEN 0.0169 ACTIONS 0.0208 SET 0.0205

ESTIMATION 0.0314 DIGITS 0.0159 FUNCTION 0.0178 SVM 0.0204

LOG 0.0263 IMAGE 0.0157 REWARD 0.0165 SPACE 0.0188

MAXIMUM 0.0254 DISTANCE 0.0153 SUTTON 0.0164 MACHINES 0.0168

PARAMETERS 0.0209 DIGIT 0.0149 AGENT 0.0136 REGRESSION 0.0155

ESTIMATE 0.0204 HAND 0.0126 DECISION 0.0118 MARGIN 0.0151

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.

Tresp_V 0.0333 Simard_P 0.0694 Singh_S 0.1412 Smola_A 0.1033

Singer_Y 0.0281 Martin_G 0.0394 Barto_A 0.0471 Scholkopf_B 0.0730

Jebara_T 0.0207 LeCun_Y 0.0359 Sutton_R 0.0430 Burges_C 0.0489

Ghahramani_Z 0.0196 Denker_J 0.0278 Dayan_P 0.0324 Vapnik_V 0.0431

Ueda_N 0.0170 Henderson_D 0.0256 Parr_R 0.0314 Chapelle_O 0.0210

Jordan_M 0.0150 Revow_M 0.0229 Dietterich_T 0.0231 Cristianini_N 0.0185

Roweis_S 0.0123 Platt_J 0.0226 Tsitsiklis_J 0.0194 Ratsch_G 0.0172

Schuster_M 0.0104 Keeler_J 0.0192 Randlov_J 0.0167 Laskov_P 0.0169

Xu_L 0.0098 Rashid_M 0.0182 Bradtke_S 0.0161 Tipping_M 0.0153

Saul_L 0.0094 Sackinger_E 0.0132 Schwartz_A 0.0142 Sollich_P 0.0141

TOPIC 19 TOPIC 24 TOPIC 29 TOPIC 87

Page 27: Models for Authors and Text Documents

ENRON Email: two example topics ENRON Email: two example topics (T=100)(T=100)

WORD PROB.

BUSH 0.0227

LAY 0.0193

MR 0.0183

WHITE 0.0153

ENRON 0.0150

HOUSE 0.0148

PRESIDENT 0.0131

ADMINISTRATION 0.0115

COMPANY 0.0090

ENERGY 0.0085

SENDER PROB.

NELSON, KIMBERLY (ETS) 0.3608

PALMER, SARAH 0.0997

DENNE, KAREN 0.0541

HOTTE, STEVE 0.0340

DUPREE, DIANNA 0.0282

ARMSTRONG, JULIE 0.0222

LOKEY, TEB 0.0194

SULLIVAN, LORA 0.0073

VILLARREAL, LILLIAN 0.0040

BAGOT, NANCY 0.0026

TOPIC 10

WORD PROB.

ANDERSEN 0.0241

FIRM 0.0134

ACCOUNTING 0.0119

SEC 0.0065

SETTLEMENT 0.0062

AUDIT 0.0054

CORPORATE 0.0053

FINANCIAL 0.0052

JUSTICE 0.0052

INFORMATION 0.0050

SENDER PROB.

HILTABRAND, LESLIE 0.1359

WELLS, TORI L. 0.0865

DUPREE, DIANNA 0.0825

ARMSTRONG, JULIE 0.0316

DENNE, KAREN 0.0208

SULLIVAN, LORA 0.0072

[email protected] 0.0026

WILSON, DANNY 0.0016

HU, SYLVIA 0.0013

MATHEWS, LEENA 0.0012

TOPIC 32

Page 28: Models for Authors and Text Documents

ENRON Email: two topics not ENRON Email: two topics not about Enronabout Enron

WORD PROB.

TRAVEL 0.0161

ROUNDTRIP 0.0124

SAVE 0.0118

DEALS 0.0097

HOTEL 0.0095

BOOK 0.0094

SALE 0.0089

FARES 0.0083

TRIP 0.0072

CITIES 0.0070

SENDER PROB.

TRAVELOCITY MEMBER SERVICES 0.0763

BESTFARES.COM HOT DEALS 0.0502

<[email protected]> 0.0315

LISTS.COOLVACATIONS.COM 0.0151

CHEAP TICKETS 0.0111

EXPEDIA FARE TRACKER 0.0106

TRAVELOCITY.COM 0.0096

[email protected] 0.0088

[email protected] 0.0066

LASTMINUTE.COM 0.0051

TOPIC 38

WORD PROB.

NEWS 0.0245

MAIL 0.0182

NYTIMES 0.0149

YORK 0.0128

PAGE 0.0095

TIMES 0.0090

HEADLINES 0.0079

BUSH 0.0077

DELIVERY 0.0070

HTML 0.0068

SENDER PROB.

THE NEW YORK TIMES DIRECT 0.3438

<[email protected]> 0.0104

THE ECONOMIST 0.0029

@TIMES - INSIDE NYTIMES.COM 0.0015

[email protected] 0.0011

AMAZON.COM DELIVERS BESTSELLERS 0.0009

NYTIMES.COM 0.0009

HYATT, JERRY 0.0008

NEWSLETTER_TEXT 0.0008

CHRIS LONG 0.0007

TOPIC 25

Page 29: Models for Authors and Text Documents

Stability of TopicsStability of Topics

Content of topics is arbitrary across runs of Content of topics is arbitrary across runs of modelmodel(e.g., topic #1 is not the same across runs) (e.g., topic #1 is not the same across runs)

However, However, Majority of topics are stable over processing timeMajority of topics are stable over processing time Majority of topics can be aligned across runs Majority of topics can be aligned across runs

Topics represent genuine structure in data Topics represent genuine structure in data

Page 30: Models for Authors and Text Documents

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100

2

4

6

8

10

12

14

16

Comparing NIPS topics from the same Comparing NIPS topics from the same Markov chainMarkov chain

KL

dist

ance

topics at t1=1000

Re-o

rdere

d t

op

ics

at

t 2=

2000

BEST KL = 0.54

WORST KL = 4.78

ANALOG .043 ANALOG .044CIRCUIT .040 CIRCUIT .040

CHIP .034 CHIP .037CURRENT .025 VOLTAGE .024VOLTAGE .023 CURRENT .023

VLSI .022 VLSI .023INPUT .018 OUTPUT .022

OUTPUT .018 INPUT .019CIRCUITS .015 CIRCUITS .015

FIGURE .014 PULSE .012PULSE .012 SYNAPSE .012

SYNAPSE .011 SILICON .011SILICON .011 FIGURE .010

CMOS .009 CMOS .009MEAD .008 GATE .009

t1 t2

FEEDBACK .040 ADAPTATION .051ADAPTATION .034 FIGURE .033

CORTEX .025 SIMULATION .026REGION .016 GAIN .025FIGURE .015 EFFECTS .016

FUNCTION .014 FIBERS .014BRAIN .013 COMPUTATIONAL .014

COMPUTATIONAL .013 EXPERIMENT .014FIBER .012 FIBER .013

FIBERS .011 SITES .012ELECTRIC .011 RESULTS .012

BOWER .010 EXPERIMENTS .012FISH .010 ELECTRIC .011

SIMULATIONS .009 SITE .009CEREBELLAR .009 NEURO .009

t1 t2

Page 31: Models for Authors and Text Documents

20 40 60 80 100

10

20

30

40

50

60

70

80

90

1002

4

6

8

10

12

14

16

Comparing NIPS topics from two different Markov Comparing NIPS topics from two different Markov chainschains

KL

dist

ance

topics from chain 1

Re-o

rdere

d t

op

ics

from

ch

ain

2BEST KL = 1.03

WORST KL = 9.49

MOTOR .041 MOTOR .040TRAJECTORY .031 ARM .030

ARM .027 TRAJECTORY .030HAND .022 HAND .024

MOVEMENT .022 MOVEMENT .023INVERSE .019 INVERSE .021

DYNAMICS .019 JOINT .021CONTROL .018 DYNAMICS .018

JOINT .018 CONTROL .015POSITION .017 POSITION .015

FORWARD .014 FORWARD .015TRAJECTORIES .014 FORCE .014

MOVEMENTS .013 TRAJECTORIES .013FORCE .012 MOVEMENTS .012

MUSCLE .011 CHANGE .010

Chain 1 Chain 2

ORDER .175 FUNCTION .091SCALE .053 ORDER .064

HIGHER .035 EQUATION .048MULTI .028 TERMS .027NOTE .028 TERM .027

VOLUME .019 THEORY .014TERMS .019 APPROXIMATION .014

STRUCTURE .017 FUNCTIONS .014SCALES .017 FORM .014

INVARIANT .012 OBTAINED .013SCALING .011 POINT .012

COMPLEXITY .010 RESPECT .012MUSIC .009 GENERAL .011NOTES .009 CASE .011TABLE .008 ASSUME .011

Chain 1 Chain 2

Page 32: Models for Authors and Text Documents

Detecting Papers on Unusual Topics Detecting Papers on Unusual Topics for Authorsfor Authors

We can calculate perplexity (unusualness) for We can calculate perplexity (unusualness) for words in a document given an author words in a document given an author

Papers ranked by perplexity for M. Jordan:

Page 33: Models for Authors and Text Documents

Author SeparationAuthor Separation

A method1 is described which like the kernel1 trick1 in support1 vector1 machines1 SVMs1 lets us generalize distance1 based2 algorithms to operate in feature1 spaces usually nonlinearly related to the input1 space This is done by identifying a class of kernels1 which can be represented as norm1 based2 distances1 in Hilbert spaces It turns1 out that common kernel1 algorithms such as SVMs1 and kernel1 PCA1 are actually really distance1 based2 algorithms and can be run2 with that class of kernels1 too As well as providing1 a useful new insight1 into how these algorithms work the present2 work can form the basis1 for conceiving new algorithms

This paper presents2 a comprehensive approach for model2 based2 diagnosis2 which includes proposals for characterizing and computing2 preferred2 diagnoses2 assuming that the system2 description2 is augmented with a system2 structure2 a directed2 graph2 explicating the interconnections between system2 components2 Specifically we first introduce the notion of a consequence2 which is a syntactically2 unconstrained propositional2 sentence2 that characterizes all consistency2 based2 diagnoses2 and show2 that standard2 characterizations of diagnoses2 such as minimal conflicts1 correspond to syntactic2 variations1 on a consequence2 Second we propose a new syntactic2 variation on the consequence2 known as negation2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm2 for computing consequences in NNF given a structured system2 description We show that if the system2 structure2 does not contain cycles2 then there is always a linear size2 consequence2 in NNF which can be computed in linear time2 For arbitrary1 system2 structures2 we show a precise connection between the complexity2 of computing2 consequences and the topology of the underlying system2 structure2 Finally we present2 an algorithm2 that enumerates2 the preferred2 diagnoses2 characterized by a consequence2 The algorithm2 is shown1 to take linear time2 in the size2 of the consequence2 if the preference criterion1 satisfies some general conditions

Written by(1) Scholkopf_B

Written by(2) Darwiche_A

Can model attribute words to authors correctly within a document?

Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author

Page 34: Models for Authors and Text Documents

Temporal patterns in topics: hot and Temporal patterns in topics: hot and cold topicscold topics

We have CiteSeer papers from 1986-2001We have CiteSeer papers from 1986-2001

We can calculate time-series for topicsWe can calculate time-series for topics Hot topics become more prevalentHot topics become more prevalent Cold topics become less prevalentCold topics become less prevalent

Do time-series correspond with known trends in Do time-series correspond with known trends in computer science?computer science?

Page 35: Models for Authors and Text Documents

1986 1988 1990 1992 1994 1996 1998 2000 20020

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

Year

Nu

mb

er o

f Doc

umen

tsDocument and Word Distribution by Year in the UCI CiteSeer Data

Nu

mb

er o

f Wo

rds

0

2

4

6

8

10

12

14x 10

5

Page 36: Models for Authors and Text Documents

Hot Topic: machine learning, data Hot Topic: machine learning, data miningmining

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

114::regression:variance:estimator:estimators:bias:153::classification:training:classifier:classifiers:generalization:205::data:mining:attributes:discovery:association:

Page 37: Models for Authors and Text Documents

The inevitability of Bayes…The inevitability of Bayes…

1990 1992 1994 1996 1998 2000 20021.5

2

2.5

3

3.5

4

4.5

5

5.5x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

189::statistical:prediction:correlation:predict:statistics:209::probabilistic:bayesian:probability:carlo:monte:276::random:distribution:probability:markov:distributions:

Page 38: Models for Authors and Text Documents

Rise in Web/Mobile topicsRise in Web/Mobile topics

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

7::web:user:world:wide:users:80::mobile:wireless:devices:mobility:ad:76::java:remote:interface:platform:implementation:275::multicast:multimedia:media:delivery:applications:

Page 39: Models for Authors and Text Documents

(Not so) Hot Topics(Not so) Hot Topics

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

23::neural:networks:network:training:learning:35::wavelet:operator:operators:basis:coefficients:242::genetic:evolutionary:evolution:population:ga:

Page 40: Models for Authors and Text Documents

Decline in programming languages, Decline in programming languages, OS, ….OS, ….

1990 1992 1994 1996 1998 2000 20022

3

4

5

6

7

8

9

10

11x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

60::programming:language:concurrent:languages:implementation:139::system:operating:file:systems:kernel:283::collection:memory:persistent:garbage:stack:268::memory:cache:shared:access:performance:

Page 41: Models for Authors and Text Documents

Security research reborn….Security research reborn….

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8

9x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

120::security:secure:access:key:authentication:240::key:attack:encryption:hash:keys:

Page 42: Models for Authors and Text Documents

Decrease in use of Greek Decrease in use of Greek Letters Letters

1990 1992 1994 1996 1998 2000 20021.5

2

2.5

3

3.5

4

4.5

5x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

157::gamma:delta:ff:omega:oe:

Page 43: Models for Authors and Text Documents

Burst of French writing in Burst of French writing in mid 90’s?mid 90’s?

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

47::la:les:une:nous:est:

Page 44: Models for Authors and Text Documents

Comparison to models that use less Comparison to models that use less informationinformation

(topics, no authors)(topics, no authors)

x

wNd D

A

da

(authors, no topics)(authors, no topics)Topics modelTopics model Author modelAuthor model

z

wD

TNd

Page 45: Models for Authors and Text Documents

Matrix Factorization Matrix Factorization InterpretationInterpretation

Authors

Top

ics

Documents

Word

s

=

Documents

Au

thors A

Topics

Word

s

AUTHOR-TOPIC MODELAUTHOR-TOPIC MODEL

Documents

Top

ics

Documents

Word

s

=

Topics

Word

s

TOPIC MODELTOPIC MODEL

Documents

Word

s

=

Documents

Au

thors A

Author

Word

s

AUTHOR MODELAUTHOR MODEL

Page 46: Models for Authors and Text Documents

Comparison Results Comparison Results Train models on part Train models on part

of a new document of a new document and predict remaining and predict remaining wordswords

Without having seen Without having seen anyany words from new words from new document, author-document, author-topic information topic information helps in predicting helps in predicting words from that words from that documentdocument

Topics model is more Topics model is more flexible in adapting to flexible in adapting to new document after new document after observing a number observing a number of wordsof words

Per

plex

ity

(new

wor

ds)

2000

4000

6000

8000

10000

12000

14000

# Observed words in document

Author model

Topics model

Author-Topics

Page 47: Models for Authors and Text Documents

Author prediction with Author prediction with CiteSeerCiteSeer

Task: predict (single) author of new CiteSeer Task: predict (single) author of new CiteSeer abstractsabstracts

Results:Results: For 33% of documents, author guessed correctlyFor 33% of documents, author guessed correctly Median rank of true author = 26 (out of 85,000) Median rank of true author = 26 (out of 85,000)

Page 48: Models for Authors and Text Documents

Number of Topics

5 10 20 50 100 200 400 8002000

2500

3000

3500

4000

4500

5000

5500

0th

1st

2nd

5th

10th

Perplexities for true author Perplexities for true author and any random authorand any random author

A = true author

A = any author

Page 49: Models for Authors and Text Documents

The Author-The Author-Topic Topic

BrowserBrowser

(b)

(a)

(c)

Querying on author

Pazzani_M

Querying on topic

relevant to author

Querying on document written

by author

http://www.ics.uci.edu/~michal/KDD/ATM.htm

Page 50: Models for Authors and Text Documents

New Applications/ Future New Applications/ Future WorkWork

Finding relevant email:Finding relevant email: "find emails similar to this email based on content”"find emails similar to this email based on content” "find people who wrote emails similar in content"find people who wrote emails similar in content

  to this one"  to this one"

Reviewer RecommendationReviewer Recommendation ““Find reviewers for this set of NSF proposals who are active Find reviewers for this set of NSF proposals who are active

in relevant topics and have no conflicts of interest”in relevant topics and have no conflicts of interest”

Change Detection/MonitoringChange Detection/Monitoring Which authors are on the leading edge of new topics?Which authors are on the leading edge of new topics? Characterize the “topic trajectory” of this author over time Characterize the “topic trajectory” of this author over time

Author IdentificationAuthor Identification Who wrote this document? Incorporation of stylistic Who wrote this document? Incorporation of stylistic

informationinformation

Page 51: Models for Authors and Text Documents

20 40 60 80 100

10

20

30

40

50

60

70

80

90

1004

6

8

10

12

14

16

18

Comparing NIPS topics and CiteSeer topicsComparing NIPS topics and CiteSeer topics

KL

dist

ance

NIPS topics

Re-o

rdere

d C

iteS

eer

top

ics

KL = 2.88

KL = 4.48

MODEL .493 MODEL .498MODELS .143 MODELS .227

MODELING .022 MODELING .055PARAMETERS .020 DYNAMIC .009

BASED .012 MODELED .008PROPOSED .010 FRAMEWORK .007

NIPS CiteSeer

SPEECH .082 SPEECH .058RECOGNITION .049 RECOGNITION .047

HMM .023 WORD .018SPEAKER .022 SYSTEM .014CONTEXT .022 SPEAKER .012

WORD .016 ACOUSTIC .010

NIPS CiteSeer

SYSTEM .234 SYSTEM .497SYSTEMS .090 SYSTEMS .350

REAL .020 BASED .012BASED .018 PAPER .012

COMPUTER .014 COMPLEX .010APPROACH .011 DEVELOPED .008

NIPS CiteSeer

KL = 4.92

FUNCTION .159 FUNCTIONS .124FUNCTIONS .115 FUNCTION .118

APPROXIMATION .069 ORDER .023LINEAR .026 APPROXIMATION .022

BASIS .018 LINEAR .016APPROXIMATE .016 INTERVAL .014

NIPS CiteSeer

KL = 5.0