automating creation of hierarchical faceted metadata structures emilia stoica, marti hearst and...

42
Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept. of Mathematical Sciences, NMSU

Upload: magnus-mckinney

Post on 16-Dec-2015

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Automating Creation of Hierarchical Faceted Metadata

Structures

Emilia Stoica, Marti Hearst and Megan Richardson*

School of Information, Berkeley *Dept. of Mathematical Sciences, NMSU

Page 2: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Focus: Browse Large Datasets Standard search interface - query box +

retrieved results – not suited for browsing and navigation

User interfaces need to group and organize the results

Page 3: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 4: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 5: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 6: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 7: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 8: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 9: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 10: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

How do we Create Faceted Hierarchies?

Goals: Help an information architect to create the

hierarchy Currently they do it all by hand!

Balance depth and breadth Avoid “skinny” paths Don’t go too deep or too broad

Choose understandable labels Disambiguate between word senses

Page 11: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Related Work

Automated text categorization LOTS of work on this Assumes that a set of categories is already

created

Little if any work on building facet hierarchies

Page 12: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Castanet

Carves out a structure from the hypernym (IS-A) relations within WordNet

Semi-automatic algorithm for creating hierarchical faceted metadata

Produces surprisingly good results for a wide range of subjects e.g., recipes, medicine, math, news, fine arts

image description

Page 13: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

WordNet Challenges

A word may have more than one sense

- Fine granularity of word sense distinctions

e.g., newspaper (#1) - daily publication on

folded sheets

newspaper (#3) - physical object

- Ambiguity for the same sense

tuna#1 cactus

#2 fish food fish bony fish

Page 14: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

WordNet Challenges (cont.)

The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes)

Sparse coverage of proper names and noun phrases (not addressed)

Page 15: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Our ApproachD

ocum

ents

Sel

ect

ter

ms

WordNet

Build core tree

Augmentcore tree

Remove

top level

categories

Compress

Tree

Divide into facets

Page 16: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

1. Select Terms

Select well-distributed terms from the collection

Eliminate stopwords Retain only those terms

with a distribution higher than a threshold

(default: top 10%)

Doc

ume

nts

WordNet

Sel

ect

term

s

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 17: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

2. Build Core Tree

Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a

count at each node on its path by # of docs with the term. frozen dessert

sundae

entity

substance,matter

nutriment

dessert

ice cream sundae

frozen dessert

entity

substance,matter

nutriment

dessert

sherbet,sorbet

sherbet

Build a “backbone” Create paths from

unambiguous terms only Bias the structure towards

appropriate senses of words

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 18: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

2. Build Core Tree (cont.)

Merge hypernym paths to build a tree

sundae

entity

substance,matter

nutriment

dessert

ice cream sundae

frozen dessert

entity

substance,matter

nutriment

dessert

sherbet,sorbet

sherbet

frozen dessert

sundae sherbet

substance,matter

nutriment

dessert

sherbet,sorbet

frozen dessert

entity

ice cream sundae

Page 19: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

3. Augment Core Tree

Attach to Core tree the terms with more than one sense

Favor the more common path over other alternatives

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 20: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Augment Core Tree (cont.)

Date (p1) Date (p2)

entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date Sunday berries date

Choose this path since it has more items assigned

??

Page 21: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Optional Step: Domains

To disambiguate, use Domains Wordnet has 212 Domains

medicine, mathematics, biology, chemistry, linguistics, soccer, etc.

A better collection has been developed by Magnini (2000) Assigns a domain to every noun synset

Automatically scan the collection to see which domains apply

The user selects which of the suggested domains to use or may add own

Paths for terms that match the selected domains are added to the core tree

Page 22: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Using Domains

dip glosses:

Sense 1: A depression in an otherwise level surface

Sense 2: The angle that a magnet needle makes with horizon

Sense 3: Tasty mixture into which bite-size foods are dipped

dip hypernyms

Sense 1 Sense 2 Sense 3

solid shape, form food

=> concave shape => space => ingredient, fixings

=> depression => angle => flavorer

Given domain “food”, choose sense 3

Page 23: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

4. Compress Tree

Rule 1: Eliminate a parent with fewer

than k children unless it is the root or its distribution is larger than 0.1*maxdist

ice cream sundae

dessert

sundae

frozen dessert

sherbet,sorbet

sherbet

parfait

dessert

frozen dessert

sundae parfait sherbet

abstraction

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 24: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

4. Compress Tree (cont.)

Rule 2: Eliminate a child whose

name appears within the parent’s name

sundae

dessert

frozen dessert

parfait sherbet

dessert

sundae parfait sherbet

abstraction

Doc

ume

nts

WordNet

Sel

ect

te

rms

Build core tree

Comp. tree

Remove top levelcateg.

Augm. core tree

Page 25: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

5. Divide into Facets

Divide into facets

Page 26: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

5. Divide into Facets(Remove top levels)

sugar syrup

entity

substance,matter

food,nutriment

ingredient,fixings

food stuff,food product

sweeteningherb

flavorer

parsley oregano sugar syrup

sweeteningherb

flavorer

parsley oregano

Rule 1: Eliminate the top t levels (t =4 for recipe collection).

Divide into facets

Rule 2: For each resulting tree, test if it has at least n children (n =2)If yes, stop. If not, delete the root and repeat.

Manual cleaning: remove facets that don’t make sense

Page 27: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Example: Recipes (13,500 docs)

Page 28: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Castanet Output (shown in Flamenco)

Page 29: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Castanet Output

Page 30: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept
Page 31: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Castanet Evaluation

This is a tool for information architects (IA), so people of this type did the evaluation

Each IA compared Castanet to other state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)

Baseline: most frequent terms in the collection Datasets

13,000 recipes from Southwestcooking.com

Page 32: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Subsumption Output

Page 33: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Subsumption Output

Page 34: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

LDA Output

Page 35: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

LDA Output

Page 36: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Evaluation Method

For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels

Then comment on overall properties Meaningful? Systematic? Likely to use in your work?

L

C

S

C

}16 }18

Page 37: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Evaluation (cont.)

Sample questions for top level categories: - Would you add/remove/rename any category ?

- Did this category match your expectations ?

Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ?

General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?

Page 38: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Evaluation Results

“Would you use this system in your work?”

“yes definitely”, “yes, in some cases”

Castanet 85%LDA 0 %

Subsumption 37%

Baseline 74%

Average response to questions about quality (4 = “strongly agree”, 3 = “agree somewhat”, 2 = “disagree

somewhat”, 1 = “strongly disagree”)

Page 39: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Evaluation Results

Average responses for top-level categories (4= “no changes”, 3 = “one or two”, 2 = “a few”, 1 = “many”)

Average responses for 2 subcategories

Page 40: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Needed Improvements

Take spelling variations and morphological variants into account

Use verbs and adjectives, not just nouns Normalize noun phrases Allow terms to have more than one sense Improve algorithm for assigning documents to

categories.

Page 41: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Conclusions

Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet.

The method has been tested on various domains: medicine, recipes, math, news, description of images

Usability study shows: Castanet is preferred to other state-of-the art solutions. Information architects want to use the tool in their work.

Future work Apply to tags (flickr, delicious)

Page 42: Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept

Learn More

Funding This work supported in part by NSF (IIS-9984741)

For more information: Stoica, E., Hearst, M., and Richardson, M., Automating

Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007

See http://flamenco.berkeley.edu