1 subject metadata enhancement using statistical topic models dlf forum april 24, 2007 david newman...

38
1 Subject Metadata Enhancement using Statistical Topic Models DLF Forum April 24, 2007 David Newman (UC Irvine) Kat Hagedorn (U Michigan)

Upload: agatha-garrett

Post on 25-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Subject Metadata Enhancement using

Statistical Topic Models

DLF ForumApril 24, 2007

David Newman

(UC Irvine)

Kat Hagedorn

(U Michigan)

2

Everyone could use better metadata!

?

3

Outline

1. OAIster: Metadata Challenges

2. Clustering: Topic Model

3. Deployment of the Prototype

4. Lessons Learned

4

• (as of Sept. 2006)• 700+ institutions • 9.6 million records• Academically-oriented

material, research literature, images, and more

• Problem: How to go beyond keyword search?

OAIster collection, e.g.,– CiteSeer– PubMed– Library of Congress– arXiv.org– PictureAustralia

plus…– Xiamen University Repository– National Library of Serbia– DSpace at Malmo University– Deep Blue

OAIster

5

Our Challenge

• OAIster wants to provide users with the best possible search and discovery experience

• Can improve search and discovery with– Better metadata – Better access to the metadata

6

How?

Enhancing access via…

• Search– limit search results by subject classification

• Browse– browse subject classification hierarchy

• Built prototype to showcase this

7

Topic Model

• State-of-the-art statistical algorithm

• Learns a set of topics or subjects covered by a collection of text records

• Works by finding patterns of co-occurring words

• Determines the mix of topics associated with each record

8

Process

Cluster

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

9

Processvocab-ulary

preprocesstopic

model(cluster/learn)

topicsCluster

OAIrecords

vocab-ulary

preprocesstopic

model(classify)

1. topics in records2. records in topicsoai

rec

Classify

OAIrecords

10

Process

Cluster

Classify

clustering is learning the

topics

classification is using the

learned topics

vocab-ulary

preprocesstopic

model(cluster/learn)

topicsOAI

records

vocab-ulary

preprocesstopic

model(classify)

1. topics in records2. records in topicsoai

recOAI

records

11

Building Vocabulary

• Preprocessed (sampled) repositories, excluded stopwords

• Only kept words that occurred in more than 10 records

• Result: a final vocabulary with ~ 90,000 words

12

OAI Record (from USC repository)

Coast route from Los Angeles to San Diego. Part one: Los Angeles to Santa Ana, 1927

Strip map of the automobile route from Los Angeles to Santa Ana via the coast route. Includes Los Angeles (top %26 left), Tustin (bottom), La Habra (right). Due north is about 35 degrees to the right of vertical. Principal features: municipalities, railroads, roads, mileages, road names, hotels, garages, Auto Club offices. Prominent locations (cities; streets; geographical features; institutions): Los Angeles (Broadway, 7th Street, Whittier Boulevard), Pasadena, Alhambra, San Gabriel, Montebello, Downey, Whittier, La Habra (Spadra), Fullerton, Anaheim (Los Angeles Street), Orange, Santa Ana (Main Street, 1st Street, 4th Street), Tustin; Rio Hondo, San Gabriel River; State School, Orange County Hospital.

Maps, Tourist; roadways; Automobile Club of Southern California

13

OAI Record

Harbor Freeway at 3rd Street overpass

14

Preprocessing Example

<ID=oai:CiteSeerPSU:44072>

<title>Reinforcement Learning: A Survey

<description>This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." …

<subject>Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey

vocab-ulary

preprocess

<ID=oai:CiteSeerPSU:44072>

reinforcement learning survey

survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement …

leslie pack kaelbling littman andrew moore reinforcement learning survey

15

Example Topics (1)

Words in Topic Topic Label

gene sequence genes sequences cdna region amino_acid clones encoding cloned coding dna genomic cloning clone

gene sequencing

social cultural political culture conflict identity society economic context gender contemporary politic world examines tradition sociology institution ethic discourse

cultural identity

general_relativity gravity gravitational solution black_hole tensor einstein horizon spacetime equation field metric vacuum scalar matter energy relativity

relativity

house garden houses dwelling housing homes terrace estate home building architecture residence homestead residences road cottage domestic fences lawn historic

domestic architecture

16

Example Topics (2)

Words in Topic Usefulness

large small size larger smaller sizes scale sized largest

Reasonable but unusable

foi para pacientes por foram dos doen resultados grupo das tratamento entre

Topic about patient treatment, in Spanish

building street visible santa_ana view avenue public_library front orange corner

Not usable: mix of concept words and specific geographic location words

17

Topics Assigned to a Record

Metadata Record Topic Labels

(% words assigned)

Aggregating sets of judgments: two impossibility results compared. (C. List and P. Pettit)

May's celebrated theorem (1952) shows that, if a group of individuals wants to make a choice between two alternatives (say x and y), then majority voting is the unique decision procedure satisfying a set of attractive minimal conditions ...

game theory (21%)

argument (12%)

criteria (7%)

18

Top Records in “Game Theory”[GAME THEORY] game games equilibrium preferences player cooperative preference equilibria cooperation collective utility individual choice bargaining coalition nash strategy

Repositories

1. Fundamental Components of the Gameplay Experience: Analysing Immersion

2. The Ethics of Computer Game Design

3. Backward Induction and Common Knowledge

4. Designing Puzzles for Collaborative Gaming Experience

5. Aggregating sets of judgments: two impossibility results compared

6. Games for Modal and Temporal Logics

7. Configuring the player - subversive behaviour in Project Entropia

8. From Mass Audience to Massive Multiplayer: How Multiplayer Games Create New Media Politics

9. Bargaining with incomplete information; Handbook of Game Theory with Economic Applications

10. Testable Restrictions of General Equilibrium Theory in Exchange Economies with Externalities

RePEc

Dspace at ANU (Australian National University)

Edinburgh Research Archive

Almae Matris Studiorum (AMS)

eScholarship Repository

CiteSeer

19

Improving Topics

• First topic model of OAIster resulted in 70% of topics being usable

• We improved topics by– Reduced Vocabulary: Remove topically

low-value words from vocabulary (manual)– Background Words Model: Automatically

detect and remove words specific to repositories (automatic)

20

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

Background Words Model

21

METADATA

<T>Horseplay</T>

<SU>Gunlock, UT; child, female, face; horse, back; rein, rope; Dorothea Lange Collection; Photographic Essays (1953-1959); Three Mormon Towns; Gunlock; Gunlock, Horseplay</SU>

22

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

METADATA

<T>Country Landscapes</T>

<SU>Ireland; town; street; sidewalk; house; cart; Dorothea Lange Collection; Photographic Essays (1953-1959); Ireland; Ireland, Country Landscapes</SU>

METADATA

<T>Horseplay</T>

<SU>Gunlock, UT; child, female, face; horse, back; rein, rope; Dorothea Lange Collection; Photographic Essays (1953-1959); Three Mormon Towns; Gunlock; Gunlock, Horseplay</SU>

California Digital Library (CDL) repository

23

Dorothea Lange Collection …

Shared Topics

CDL

Repositories

Repository-specific words

Background Words Model

24

Improvement of Music topic (Background Words)

Standard Topic Model

[MUSIC] music moa periodical_devoted ladies_repository literature_art song musical mus gov_nla nla_mus music_australian cover instrument piano musician south_america voice_piano drum peru song_piano composer marsh word singing violin playing orchestra vanity_fair

Shared Topic from BackgroundWords Topic Model

[MUSIC] music poster musical dance song theatre instrument actor concert entertainment theater piano sound festival theatrical musician drama performances opera art ballet popular performing performer folk singer composer dancer drum jazz dancing orchestra stage pieces singing recording

25

Improvement of Family Photos topic (Reduced Vocab)

Standard Topic Model

family_photograph mss jpg george_edward anderson_photograph plate_negative women_portrait gelatin_dry photograph_portrait south_africa studio_portrait children_portrait hair standing sitting portrait underwood portrait_portrait front infant_portrait

Reduced Vocab Topic Model

family_photograph wearing woman hair dress clothing shoulder baby suit dressed chair clothing_dress wear hand tie shirt jacket costume boy ribbon collar dark lap bow white full_face beard young_woman leaning striped outdoor

26

Improvement Measures

0

20

40

60

80

100

UsableTopics

RecordsEnhanced

Coverage

Original

Improved

Percent

27

Deployment

• Added topic labels to 2.5 million record sub-set of OAIster (62 repositories)

• Built search and browse interface for testing purposes

28

Creating labels• Community of “experts” created labels

for the 352 topics

29

Adding classification

• At the same time, associated classification categories

• “High Level Browse” classification developed at UM, based on LC call numbers

30

Assign to records

• Top-4 topics are assigned to records

• Plus, the chosen classification categories

• Performed for each repository

31

Resulting record

32

Performing a search

• Using interface, end-users can choose to limit their searches by subject categories mapped to the topic labels

• e.g., doing a search by “gender” and “diversity”, limited to subject categories

33

34

Search results• Can then choose to limit search by a different

subject category…on the search results page

35

Resulting record in display

36

Lessons Learned (1)

• Topic modeling allows…– Narrow/expand search results, without having to

re-do search– Clarification of search

• Labels and classification– Humanities records fared worse than scientific

records, e.g., lack of metadata, use of metaphors– Classification has some holes, e.g., history of war

37

Lessons Learned (2)

• Quality– Should use experts to label topics

• Scalability– Found ways to improve topics– Some manual, some automatic, reasonably

scalable

• Testing– Real accuracy analysis needed– Real end-user usability needed

38

Questions?

• David Newman, U. of California Irvine– [email protected]

• Kat Hagedorn, U. of Michigan– [email protected]