evaluating topic models for digital libraries · evaluating topic models for digital libraries kat...

72
Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, Irvine Youn Noh, Yale University Yale University Library 4 October 2011

Upload: others

Post on 25-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models for Digital Libraries

Kat Hagedorn, University of Michigan David Newman, University of California, Irvine Youn Noh, Yale University Yale University Library 4 October 2011

Page 2: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Talk Outline

•  Introduction (Youn) •  Topic Evaluation (Dave) •  Topic Assignment (Youn) •  Topics in Applications (Kat) •  Questions and Discussion (All)

2

Page 3: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

INTRODUCTION

3

Page 4: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Project Scope, Purpose, and Participants

•  Scope: Research and development project funded by a National Leadership Grant from the Institute of Museum and Library Services

•  Purpose: Investigate applications of topic modeling for library and museum collections

•  Participants: – Yale University (lead) – University of California, Irvine – University of Michigan

4

Page 5: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Project Goals

•  Propose topic modeling as one solution to organizing the massive number of digital objects produced by large scale digitization projects

•  Measure the extent to which topic modeling improves search and discovery

•  Disseminate these results and findings •  Show how topic modeling can be deployed and

applied by multiple institutions and constituencies •  Produce open source software and courseware for

others to use

5

Page 6: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Ranganathan’s Five Laws of Library Science

Image: Some rights reserved by Sarah_G

Maybe topic modeling could help!

6

Page 7: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

How can you get involved?

•  Ask questions or get in touch after the talk •  Attend the Teaching with Technology Tuesdays

session today at 11:00 in Bass L01 •  Try out our topic modeling tool:

http://code.google.com/p/topic-modeling-tool/ •  Contact us

– Youn Noh ([email protected]) – David Newman ([email protected]) – Kat Hagedorn ([email protected])

7

Page 8: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Acknowledgments

Adrian Knight * Arun Balagopalan Bill Dueber * Carolyn Caizzi * Elizabeth Boyce

Emmanuelle Delmas-Glass * Ernie Marinko Francesca Slade * Holly Rushmeier * Huang Huang

Joan Swanekamp * John Weise * Julie Dorsey Justo Arosemena * Karen Kupiec * Karen Rizvi Katie Bauer * Ken Hamma * Marlon Forrester

Meg Bellinger * Megha Okhai * Michael Kargela Mike Friscia * Nick Garver * Pam Franks

Patrick Paczkowski * Rebekah Irwin * Robert Carlucci Robert Nelson * Roger Espinosa * Scott Matheson

Scott Wilcox * Sherlock Campbell * Suzanne Chapman 8

Page 9: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 1

Evaluating Topic Models

Page 10: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 2

Background

•  IMLS Research Project –  Improving Search and Discovery of Digital Resources Using

Using Topic Modeling –  Yale (lead), UMich, UC Irvine

•  Apply topic modeling to three classes of digital library resources: full-text books, images, and tagged objects

•  Build prototypes of user interfaces that make use of topics

•  Test the prototypes to assess the value of topic modeling for users

Page 11: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 3

Collections and challenges

•  Digitized books

•  Images

•  Scientific literature

•  Web 2.0 content

•  ! and more

Page 12: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 4

Collections and challenges

•  Digitized books

•  Images

•  Scientific literature

•  Web 2.0 content

•  ! and more

Currently Digitized •  6,182,629 total volumes

3,621,100 book titles 146,505 serial titles 2,163,920,150 pages 230 terabytes 73 miles 5,023 tons

Page 13: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 5

Collections and challenges

•  Catalog Search –  Subj: “American Colonial

History” 20,000 results •  Full-Text Search

–  “American Colonial History” 1,000,000 results

•  Limitation –  Users don’t have mental

model –  Users don’t trust metadata

•  Digitized books

•  Images

•  Scientific literature

•  Web 2.0 content

•  ! and more

Page 14: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 6

Collections and challenges

•  Digitized books

•  Images

•  Scientific literature

•  Web 2.0 content

•  ! and more

q = “madonna and child”

Page 15: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 7

Collections and challenges

•  Digitized books

•  Images

•  Scientific literature

•  Web 2.0 content

•  ! and more

•  1000 new articles daily •  Indexing using MeSH

Page 16: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 8

What is Topic Modeling?

•  Topic Modeling (aka Latent Dirichlet Allocation)

•  Updated version of Latent Semantic Analysis

•  State-of-the-art model for collections of text documents

•  Works great on large collections of well written content

Page 17: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 9

Topic model learns topics from co-occurring words

Collection of NSF Awards

Topic Model: -  Use words from title

& abstract -  Learn 400 topics

t49 t18 t114 t305

Automatically Learned Topics (10 of 400):

!

t6. conflict violence war international military !

t7. model method data estimation variables !

t8. parameter method point local estimates !

t9. optimization uncertainty optimal stochastic !

t10. surface surfaces interfaces interface !

t11. speech sound acoustic recognition human !

t12. museum public exhibit center informal outreach

t13. particles particle colloidal granular material !

t14. ocean marine scientist cosee oceanography !

t15. atmospheric chemistry ozone air organic !

!

Topic tags for each and every document

“Topics”

“Topic Tags”

Think of topic modeling as automatic assignment of subject headings ! that

you learn

Page 18: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 10

A closer look at one automatically learned topic

topic-6: conflict violence war international military domestic political government terrorism national security civil !

•  What is this topic about? Is it a meaningful topic?

•  [How] Do we present this to users? ! What is a good label for this topic?

Page 19: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 11

Overarching Questions

Q1: Are individual topics meaningful and usable? Q2: Are assignments of topics to documents meaningful

and usable? Q3: Do topics facilitate better or more efficient document

search, navigation, browsing?

Page 20: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 12

Experimental Setup

Collection Sources Volume

Books Internet Archive Hathi Trust

12,000 books 28,000 books

News Articles LDC Gigaword (NY Times articles)

55,000 articles

Grant Abstracts National Institutes of Health

60,000 grants

Image Metadata Yale Library UMich Library

100,000s 100,000s

Page 21: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 13

Experimental Setup

•  Topic modeled each document collection (using different topic resolutions). Selected a total of 600 topics for manual coherence scoring

•  Have N = 9-15 annotators score each of the 600 topics on a 3-point scale where 3=“useful” (coherent) and 1=“useless” (less coherent), based on the top-10 topic words –  also asked annotators to identify “best” topic word ! and –  suggest a short label for each topic

where 3=“useful” (coherent)

Page 22: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 14

Example Coherent and Incoherent Topics

Books silk lace embroidery tapestry gold embroidered ... trout fish fly fishing water angler stream rod flies ...

Books head friend fellow captain open sure turn human ... soon gave returned replied told appeared arrived ...

News space earth moon science scientist light nasa ... health drug patient medical doctor hospital care ...

News oct sept nov aug dec july sun lite adv globe ! dog moment hand face love self eye turn young ...

Metadata japan scroll kamakura ink hanging oyobe hogge silk ... persian iran manuscript folio firdawsi century ...

Metadata abstraction sculpture united female english tabbaa ... oil john canvas william james england henry robert !

Coherent (unanimous score=3)

Less coherent (unanimous score=1)

Page 23: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 15

Automatic Scoring of Topics?

•  Coherence of topic depends on relatedness of all 10-choose-2 pairs of top-10 topic words

•  Idea: Use external data to evaluate word pair

relatedness (e.g. Wikipedia)

Collection being topic modeled

Page 24: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 16

Topic: music dance band rock opera ! Pointwise Mutual Information (measure of dependence)

Relatedness of word pairs

)Pr()Pr(),Pr(

log),(21

2121 ww

wwwwPMI =

jiijwwPMIwScorePMIij

ji <!=" # ,101,),()( …

Page 25: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 17

Relatedness of Word Pairs

Topic: music dance band rock opera !

PMI-Score = 4.5 + 4.2 + ! + 1.4

Page 26: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 18

PMI-score achieves human-level performance

Average human score

PM

I-sco

re

BOOKS (280 topics, corr=0.78)

Page 27: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 19

PMI-score achieves human-level performance

Average human score

PM

I-sco

re

NEWS (117 topics, corr=0.77)

Page 28: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 20

Experiments

Collection Human-human correlation

PMI-Human correlation

Books 0.76 0.78

News Articles 0.73 0.77

Grant Abstracts 0.48 0.63

Image Metadata 0.51 0.53

Page 29: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 21

Outliers

•  PMI-score over-predicts coherence –  thou thy thee hast hath thine mine heart god heaven –  viii vii xii xiii xiv xvi xviii xix xvii main –  century fifteenth thirteenth fourteenth twelfth sixteenth middle . –  want look going deal try bad tell sure feel remember

•  PMI-score under-predicts coherence –  public government america policy nation political issues ! –  british britain england country united national foreign nation ... –  account cost item profit balance statement sale credit loss !

Page 30: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 22

Best topic word and suggested Label

Topic Suggested Label trout fish fly fishing water angler stream rod flies ... fly fishing space earth moon science scientist light nasa ... space exploration race car nascar driver racing ! nascar racing

Page 31: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 23

Best topic word task

Topic trout fish fly fishing water angler stream rod flies ... space earth moon science scientist light nasa ... race car nascar driver racing ! Features •  PMI( word1, word2 ) •  Prob( word | * ) ! word is evoked by other words ! e.g. space •  Prob( * | word ) ! evokes other words ! e.g. nascar

SVM-rank using above features beats baseline of first topic word (Lau, Newman, Karimi, Baldwin COLING 2010)

Page 32: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 24

Suggested Label ! from Wikipedia (work in progress)

Topic Wiki Article Titles trout fish fly fishing water angler stream rod flies ... fly fishing

fishing angling trout !

space earth moon science scientist light nasa ... space exploration

space space science space colonization nasa missions !

Page 33: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 25

Large-Scale User Studies

•  Developed prototype user interfaces for Images Collections and Books Collections that use topics

•  Large scale user studies at Yale and UMich underway

•  Qualitative and quantitative assessment value of topics

Page 34: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Evaluating Topic Models 26

Conclusion

•  Topic models seem to be useful in DLs for creating additional metadata

•  But learned topics can vary in quality and coherence

•  We developed model to automatically evaluate topic coherence that matches human judgments

•  This is a useful step in integrating topic modeling into digital libraries

Page 35: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

TOPIC ASSIGNMENT

9

Page 36: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Overview (1)

•  We frequently have to categorize things in our everyday lives (email, web pages, documents, photos)

•  But this task is often time-consuming and error-prone and, for library collections, requires specialized training

Image: Library of Congress. (2009). Understanding MARC Bibliographic. 10

Page 37: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Overview (2)

•  Recall that topic modeling simultaneously learns the topics in a collection and assigns topics to documents

•  Topic modeling is similar to open card sorting •  In an open card sort, documents are sorted into

conceptual categories that haven’t been pre-specified •  For topic modeling:

– The # of topics (or categories) is the only input – Documents (or cards) may belong to more than

one topic (or category)

11

Page 38: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Overview (3)

Image: GravityFreedom. (2007 June 14). Card Sorting Your Way to an Improved User Experience. 12

Page 39: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Overview (4)

•  Latent topics are revealed in the sorting process •  In a separate step, categories are labeled based on

the scope and content of cards in the category •  In a similar fashion, topics are labeled based on the

scope and meaning of the words in the topic

13

Page 40: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Pros •  Automatically generates

topics and assignments •  Takes relatively little

time •  Topics are consistently

and comprehensively assigned

•  Easy to update topics and assignments as collection grows and its scope evolves

Cons •  Topics are lists of words

that need to be labeled •  Domain knowledge is

often required to interpret and label topics

•  Statistically valid topics are not necessarily meaningful to users

Pros and Cons of Topic Modeling

14

Page 41: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Topic Assignment Study

•  Purpose of study: Explore potential for using learned topics as subjects

•  Research question: How well do learned topics perform, as compared to Library of Congress Subject Headings (LCSH), in clustering like documents?

•  Practical application: Many digital library applications use categories or related items

15

Page 42: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Scope of Study

•  Participants: 88 students from the University of Michigan and 62 students from Yale University

•  Test Collection: 27,710 digitized books in the subject areas of art and architecture from the University of Michigan Libraries

16

Page 43: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Study Design

•  21 matched topic-LCSH pairs •  5 books assigned each topic or LCSH

17

Page 44: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

A Sample Pair

muscle bone lower head upper tendon joint front arm finger form surface outer rib spine ...

Anatomy, Artistic

18

Page 45: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Another Sample Pair

drawer leg chair chest top mahogany front rail seat walnut carved furniture cupboard cabinet pine ...

Furniture

19

Page 46: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

A Sample Question

Do these books belong together?

Strongly disagree 1 2 3 4 5 Strongly

agree

20

Page 47: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Test Format

•  Testing was conducted online •  Each of the 3 versions of the test had 14 items (7

topic-LCSH pairs presented in randomized order) •  In addition to the cover, participants could look inside

the book, using a Google Preview button

21

Page 48: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Results

•  We received 47-50 responses for each item •  We applied the Mann Whitney U test to the mean

scores of the 21 topic-LCSH pairs: – 11 (p<.05) received higher topic scores – 4 (p<.05) received higher LCSH scores – 6 (p>.05) received similar scores

•  Overall, topics outperformed LCSH

22

Page 49: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

Observations

•  Some topics brought together words used in different contexts – A sample topic: air heating pipe heat gas hot

supply temperature tank boiler steam equipment electric pressure fuel ...

– Assigned to books on garages, public baths, theaters, and office buildings

•  Topic assignments were generally more specific than LCSH assignments, so the books in the topic sets tended to be more cohesive

23

Page 50: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

!"#$%&'(&)*'$+',##-$%./"+&'

0.1'2.3)*"4+'

(+$5)4&$16'"7'8$%9$3.+':$;4.4$)&'

Page 51: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

<.%=34">+*'

•  (&$+3'.'1"#$%'?"*)-$+3'.-3"4$19?'1"'@+*'4)-)5.+1'?.1)4$.-&'$+'.'-.43)'%"4#>&'"7'1)A1>.-'$1)?&'$&'+"1'+)BC'

•  2"B)5)4D'1"'*.1)'19)4)'9.&';))+'-$E-)'$+5)&/3./"+'$+1"'$1&'>&)7>-+)&&'1"')+*F>&)4&C''

Page 52: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

G+5)&/3./"+H'!B"''8)19"*&'

•  (+?"*)4.1)*'&1>*6'– +"'"+)'1"'9)-#'#.4/%$#.+1&I'– %"?#-$%.1)*'&)1>#'J'#4"%)&&'

•  8"*)4.1)*'&1>*6'–  7.%)F1"F7.%)'*$&%>&&$"+&'– )A#)41&'BC'19)$4'"B+'&).4%9)&'

Page 53: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

2.19$!4>&1H'&#)%$.-'$+1)47.%)'

•  (&)*'.'&>;&)1'"7'19)'2.19$!4>&1'K$3$1.-':$;4.46'1"'1)&1';"19'?)19"*&C'– LMDMNO'*$3$/P)*'5"->?)&''

– %"44)&#"+*$+3'1"'NQDQQN'4)%"4*&''–  74"?'19)'RSR'T@+)'.41&U':$;4.46'"7'V"+34)&&'W>;X)%1'2).*$+3'T:VW2U'%-.&&$@%./"+'

– #4$?.4$-6'$+'19)'Y+3-$&9'-.+3>.3)'

–  -$+=)*'1"';"19'$+F%"#64$391'.+*'7>--F5$)BF.%%)&&'5"->?)&'

Page 54: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

2.19$!4>&1H'&#)%$.-'$+1)47.%)'

Page 55: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

(+?"*)4.1)*H'1)&/+3'&%4$#1'

•  !.&=&'.+*'Z>)&/"+&'*)&$3+)*'1"'>/-$P)'>+?"*)4.1)*'1)&/+3')+5$4"+?)+1';6'.--"B$+3'#.4/%$#.+1'1"'74))-6'@+*'4)&>-1&'19.1'&>$1)*'19)'1.&=C'

•  ,1'19)'&.?)'/?)D'4)Z>$4)*'19)?'1"'*)&%4$;)'19)'4)&>-1&'.+*'4)%"4*&'19)6'7">+*';6'.+&B)4$+3'7"--"BF>#'Z>)&/"+&C'

Page 56: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

(+?"*)4.1)*H'1)&/+3'&%4$#1'

•  !"#$#%"&H'#)47"4?'.'3)+)4.-'&).4%9'7"4'>4;.+'#-.++$+3D'.&'$7'6">'B)4)'#.41'"7'.+'$+14"*>%1"46'.4%9$1)%1>4)'%-.&&'

•  '()$#%"&H'*"'.'*$[)4)+1'&).4%9'$+'19)'&.?)'&>;X)%1'.4).'.+*'19$&'/?)'>&)'19)'S.44"B'W).4%9'%"->?+''

•  *+)$#%"&H'*"'.'*$[)4)+1'&).4%9'.+*'&#)%$@%.--6'7"%>&'"+'19)'1"#$%'7.%)1'

•  W./&7.%/"+'5)%1"4&'T74"?'19)')+*F"7F1.&='&>45)6&UH''–  4)&>-1&'19)6'7">+*'–  ;""='19)6'%9"&)''–  7.%)1&'%"->?+'–  1"#$%&'$+'7.%)1&'%"->?+'

Page 57: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

(+?"*)4.1)*H''&)1>#'

•  V"?#-)1)-6'9.+*&F"['1)&/+3C'

•  !9)'&$3+'B.&'.'9>3)'9)-#I'

\.#)4'#4$+1">1'B.&'5$1.-'1"'3>$*)'#.4/%$#.+1&]'

Page 58: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

(+?"*)4.1)*H'&)1># ''

•  <)%.>&)'8"4.)'B.&'+"1'.%1>.--6'.+'">1F"7F19)F;"A'1""-D'+))*)*'&>##"41'&)45$%)&'1"';>$-*'-.6)4&'.4">+*'$1C'

G+$/.-';4.$+&1"4?$+3]'

Page 59: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

!9$&'#.4/%$#.+1'+.44"B)*';6'19)'

1"#$%'^.4%9$1)%1>4)_'

(+?"*)4.1)*H'>+*)4B.6'

\.4/%$#.+1&'%">-*'-$?$1';6'1"#$%'$+'19)'-)`F9.+*'TS.44"B'W).4%9U'%"->?+'

Page 60: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

(+?"*)4.1)*H'>+*)4B.6'

\.4/%$#.+1&'&1)##)*'

194">39'19)'1.&=&'$+'8"4.)'

Page 61: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

(+?"*)4.1)*H'>+*)4B.6'

,1'19)')+*'"7').%9'1.&='Ta'1"1.-UD'

19)4)'B.&'.'&>45)6'

!9)'&>45)6'B.&'19)'"+-6'?)19"*'B)'%">-*'>&)'1"'@+*'">1'$7'#.4/%$#.+1&'B)4)'&./&@)*'B$19'19)$4'4)&>-1&C'

Page 62: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

(+?"*)4.1)*'4)&>-1&'

•  !9)4)'B)4)'.'-"1'"7'4)&>-1&I'TaOb'#.4/%$#.+1&U'•  W"?)'4)&>-1&'B)4)'%"+14.*$%1"46]'

•  c5)4.--]'>&.3)'"7'1"#$%'7.%)1&'&1.6)*'9$39'.+*'%"+&1.+1'194">39">1'19)'&1>*6C'<)1B))+'MaFQad'"7'#.4/%$#.+1&'>&)*'19)'1"#$%'7.%)1&C'

Page 63: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

V"+14.*$%1"46'4)&>-1&'

!"#$%&%'()#*+,#%-./),%"#$)0%1+%2++$%"1%"22%3"4)1#5%

!"#$%&%#"*#3"4*+,%6",7)%

!"#$%8%'()#*+,#%-./),%"#$)0%1+%2++$%#9)4:;4"22<%"1%1+9:4%3"4)1#5%

!"#$%8%#"*#3"4*+,%6",7)%

e)4)'6">'&./&@)*'B$19'19)';""='6">'%9"&)f'

MMd'&./&@)*' e)4)'6">'&./&@)*'B$19'19)';""='6">'%9"&)f'

gOd'&./&@)*'

e)4)'6">'&./&@)*'B$19'19)'4)&>-1&'6">'7">+*f'

MNd'&./&@)*' e)4)'6">'&./&@)*'B$19'19)'4)&>-1&'6">'7">+*f'

agd'&./&@)*'

K$*'6">'@+*'19)'RS.44"B'W).4%9R'%"->?+'"+'19)'-)`'>&)7>-'$+'4)@+$+3'6">4'4)&>-1&f''

MQd'&./&@)*' K$*'6">'@+*'19)'R!"#$%&R'-$&1)*'$+'19)'RS.44"B'W).4%9R'%"->?+'"+'19)'-)`'>&)7>-'$+'4)@+$+3'6">4'4)&>-1&f'

hhd'&./&@)*'

W./&7.%/"+'4./+3'B)+1'*"B+'.`)4'#.4/%$#.+1&'B)4)'.&=)*'1"'>&)'19)'1"#$%'7.%)1&'&#)%$@%.--6]';>1'&1.6)*'9$39'B9)+'#.4/%$#.+1&'B)4)'.&=)*'1"'-""='.1'.--'7.%)1&C'

Page 64: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

V"+14.*$%1"46'4)&>-1&'•  \.4/%$#.+1&'B9"'>&)*'?"4)'1"#$%'7.%)1&'9.*'9$39)4'&./&7.%/"+'&%"4)&D'B9)+'19)6'B)4)'+"1'&#)%$@%.--6'.&=)*'1"'>&)'19)?C'

'•  W))?&'19.1'>&)'"7D'.+*'&./&7.%/"+'B$19'>&$+3D'19)'1"#$%&'$&'/)*'1"'19)'>&)'"7'"19)4'7.%)1&'$+'19)'7.%)1'%"->?+'

!+9:4%="4)1#%>)2)41)0! ?@)6"7)%>"*#3"4*+,%>4+6)%!"#$%&!

O" LChNOMNh"

N" NCMOhghg"

L" NCQOOOOO"

Ni9$39'&./&7.%/"+D'

gi-"B'&./&7.%/"+'

Page 65: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

8"*)4.1)*H'1)&/+3'&%4$#1'

•  0))+'1"'146'.'*$[)4)+1'?)19"*"-"36H'4)#-$%.1)'4)&>-1&'"4'*$&%"5)4'5.4$./"+&'"+'19)?C'

•  W$+%)'B)'9.*';))+'>+.;-)'1"'*$&%>&&'.+619$+3'B$19'#.4/%$#.+1&'*>4$+3'19)'"4$3$+.-'&1>*6D'7.%)F1"F7.%)'1)&/+3'B.&'.+'";5$">&'+)A1'&1)#C''

•  ,-&"D'>+?"*)4.1)*'&1>*6'*>)'1"'$1&'1)&1'*)&$3+'9.*'7"%>&)*'$+'-.43)'#.41'"+'19)'>&)'"7'19)'7.%)1&'19)?&)-5)&'.+*'1"'.'&?.--')A1)+1'"+'19)'%"+1)+1'"7'19)'%4).1)*'1"#$%&C'

Page 66: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

8"*)4.1)*H'1)&/+3'&%4$#1'

•  !"#$#%"&H'%"?#.4)'1B"'&).4%9)&D'"+)'>&$+3'1"#$%'7.%)1&'1"'-$?$1'19)'&).4%9'

•  '()$#%"&H'%"??)+1'"+'Z>.-$16'.+*'4.+=$+3'"7'&).4%9'#)47"4?)*'>&$+3'6">4'"B+'&).4%9'1)4?&j'%"?#.4)':VW2'J'1"#$%'7.%)1&'

•  *+)$#%"&H'-""='.1'.'4)%"4*'74"?'19$&'&).4%9j'%"?#.4)':VW2'J'1"#$%'7.%)1&j'%"??)+1'"+'.##4"#4$.1)+)&&'"7'1"#$%&'

Page 67: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

8"*)4.1)*'4)&>-1&'

% A("2:1<%+3%6)#(21# >(9)6:+6:1<%B%/)293(2,)##%+3%1+9:4#

?996+96:"1),)##%+3%1+9:4#%:,%6)4+60#

,%"&$!H'4)5$)B'L'&)1&'"7'4)&>-1&'T">4'&).4%9U

Q'"7'Q'TNOOdU'i'3""* +k. +k.

,%"&$'H'4)5$)B'1"#$%&'.+*':VW2'$+'4)&>-1&'T19)$4'&).4%9U

g'"7'Q'TbLCgdU'i'3""*'a'"7'Q'TaMCgdU'i'#""4

h'"7'Q'TgOdU'i'1"#$%&'&>#)4$"4'a'"7'Q''TaMCgdU'i':VW2'&>#)4$"4'N'"7'Q'TNLCgdU'i'+)$19)4

+k.

,%"&$*%H'%9""&)'.+*'4)5$)B'1"#$%'B$19$+'4)&>-1&'T19)$4'&).4%9U

b'"7'Q'TMgdU'i'3""*'L'"7'Q'TLgdU'i'#""4

g'"7'Q'TbLCgdU'i'1"#$%&'9)-#)*'a'"7'Q'TaMCgdU'i'1"#$%&'*$*+l1'9)-#

+k.

,%"&$*-H'4)5$)B'1"#$%&'.+*':VW2'$+'4)%"4*'T19)$4'&).4%9U

+k. b'"7'Q'TMgdU'i':VW2'&>#)4$"4'N'"7'Q'TNLCgdU'i'1"#$%&'&>#)4$"4'N'"7'Q'TNLCgdU'i'+)$19)4

M'"7'Q'TQMCgdU'i'1"#$%&'.##4"#4$.1)'N'"7'Q'TNLCgdU'i'1"#$%&'+"1'.##4"#4$.1)

!"#$%&'B)4)'9)-#7>-'.-19">39':VW2'B.&'%"+&$*)4)*'&>#)4$"4'1"'1"#$%&C'2"B)5)4D'-.43)'?.X"4$16'"7'#.4/%$#.+1&'19">391'1"#$%&'.&&$3+)*'1"'4)%"4*&'B)4)'.##4"#4$.1)C'

Page 68: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

8"*)4.1)*'4)&>-1&'

•  8"&1'$+1)4)&/+3'4)&>-1]'B9)+'#.4/%$#.+1&'-""=)*'.1':VW2'.+*'1"#$%&'1"3)19)4D'gOd'7">+*'1"#$%&'&>#)4$"4'.+*'aMCgd'7">+*':VW2'&>#)4$"4'i'.;">1')Z>.-C'

•  \.4/%$#.+1&'&.6H'–  R!9)6'm1"#$%&'.+*':VW2n'&))?'1"'4)-.1)'B)--'1"').%9'"19)4C_'

–  RG1'm>&)7>-+)&&'194">39">1'19)'&1>*6n'*$*'&B$1%9';)1B))+'1"#$%&'.+*'&>;X)%1'm:VW2nC'W"'$+'19.1'&)+&)'$1o&'34).1'1"'9.5)';"19C_'

•  W))?&'-$=)'%"+X>+%/"+'"7'19)'1B"'16#)&'"7'&>;X)%1F;.&)*'-$?$1)4&'$&'9)-#7>-'1"'#.4/%$#.+1&'9"#$+3'1"'7>419)4'4)@+)'.'&).4%9'4)&>-1C''

Page 69: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

8"4)'&>##"41'7"4'1"#$%&]'

•  ,&=)*'#.4/%$#.+1&'$7'19)6'19">391'19)'>&)'"7'1"#$%&'3.5)'.%%>4.1)'4)&>-1&'T;""=&'%9"&)+'?.1%9)*'1"#$%&'%9"&)+UC'QMCgd'&.$*'6)&C'

•  ,&=)*'#.4/%$#.+1&'$7'19)6'B">-*'>&)'1"#$%'7.%)1&'.3.$+'$+'19$&'$+1)47.%)C'Mgd'&.$*'6)&C'

•  ,-&"]''&"?)'4)&>-1&'&>##"41'%"+%->&$"+'19.1'>&$+3'?"4)'19.+'"+)'1"#$%'.1'.'/?)'1"'+.44"B'"4')A#.+*'.'&).4%9'%.+';)'?"4)')[)%/5)'19.+'>&$+3'X>&1'"+)'1"#$%'"+'$1&'"B+C'T,+*'19$&')[)%/5)+)&&'*)%4).&)&'B9)+'>&$+3'?"4)'19.+'194))'1"#$%&'.1'.'/?)CU'

Page 70: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

V"+%->&$"+&'

•  (+?"*)4.1)*'&1>*6'1"-*'>&H'>&)'"7D'.+*'&./&7.%/"+'B$19'>&$+3D'19)'1"#$%&'$&'/)*'1"'19)'>&)'"7'"19)4'7.%)1&'$+'19)'7.%)1'%"->?+C'

•  8"*)4.1)*'&1>*6'1"-*'>&H'%"+X>+%/"+'"7'19)'1B"'16#)&'"7'&>;X)%1F;.&)*'-$?$1)4&'$&'9)-#7>-'1"'#.4/%$#.+1&'9"#$+3'1"'7>419)4'4)@+)'.'&).4%9'4)&>-1C''

•  e)'%.+'%"+%->*)'19.1'#.4/%$#.+1&'5$)B'1"#$%&'.+*':VW2'.&'.##-)&'.+*'.##-)&D'+"1'.##-)&'.+*'"4.+3)&C'!9)6'B.+1'1"'>&)'19)?'1"3)19)4'1"'4)@+)'19)$4'4)&>-1&C'

Page 71: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

p>419)4'e"4='

•  e9.1'4)&>-1&'B">-*'B)'3)1'>&$+3'.--'"7'19)'4)%"4*&'$+'19)'W>??"+'&)45$%)'74"?'W)4$.-&'W"->/"+&f'

•  e9.1'4)&>-1&'B">-*'B)'3)1'$7'B)'1"#$%'?"*)-)*'.--'"7'19)'2.19$!4>&1'5"->?)&'T%>44)+1-6'"5)4'qCg'?$--$"+Uf'

•  2"B'B">-*'19)'4)&>-1&'*$[)4'$7'B)'>&)*'.'%"--)%/"+'"7':.B'"4',&14"+"?6'1)A1&f'

•  2"B'B">-*'>&)4&')5.->.1)'1"#$%&'.3.$+&1'.'*$[)4)+1'16#)'"7'%-.&&$@%./"+'&>%9'.&'19)',41'J',4%9$1)%1>4)'!9)&.>4>&'T,,!Uf'

Page 72: Evaluating Topic Models for Digital Libraries · Evaluating Topic Models for Digital Libraries Kat Hagedorn, University of Michigan David Newman, University of California, ... –

p>419)4'$+7"4?./"+'

•  0.1'B"4=&'.1'(+$5)4&$16'"7'8$%9$3.+':$;4.4$)&'r9E#HkkBBBC-$;C>?$%9C)*>k-$1k*-#&s'

'

•  V>44)+1'TW)#1)?;)4kc%1";)4U'$&&>)'"7'KF:$;'8.3.P$+)'9.&'7>--'.4/%-)'"+'19$&'B"4='r9E#Hkk*AC*"$C"43kNOCNOhgk&)#1)?;)4LONNF9.3)*"4+s'

'