hierarchical summaries

35
Hierarchical Summaries By: Dawn J. Lawrie University of Massachusetts, Amherst for Search

Upload: hollye

Post on 16-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Hierarchical Summaries. for Search. By: Dawn J. Lawrie University of Massachusetts, Amherst. The Problem. Possible Solution. Possible Solution. Solution: Automatic Hierarchies. Strengths of Automatic Hierarchies. Word-based summary Focus on topics of the documents - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hierarchical Summaries

Hierarchical Summaries

By: Dawn J. LawrieUniversity of Massachusetts, Amherst

for Search

Page 2: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

2

The Problem

Page 3: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

3

Possible Solution

Page 4: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

4

Possible Solution

Page 5: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

5

Solution: Automatic Hierarchies

Page 6: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

6

Strengths of Automatic Hierarchies

Word-based summary Focus on topics of the documents Allows users to navigate through the results Easy to understand Bonus: Useful for summarizing documents

Page 7: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

7

Endangered Animals (2910)

marine mammals (188)

Hand-generated hierarchy of 50 documents Query: “Endangered Species (Mammals)”

Example

legislation (64)

permits (102)

Critical Habitat (160)

Endangered plants (70)

Ecosystem Management (20)

Threatened (10)

Endangered Species Act (10)

mammals (1710)

fish (70)

birds (30)

insects (30)

amphibians (10)

sea lions (22)

manatees (11)

whales (74)

jaguars (20)

marine (128)

deer (11)

habitat protection (11)

rats (10)

Hawaii (30)

California (20)

Utah (10)

Virginia (10)

Melicope Species (10)

Wainae Plant Cluster Recovery Plan (10)

Waianae Mountains (10)

Page 8: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

8

Proposed Framework

DocumentSet

LanguageModel

Term SelectionAlgorithm

Hierarchy

“Term” = word or phrase

Page 9: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

9

Challenges

Selecting terms for the hierarchy Displaying the hierarchy Showing that it works

Page 10: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

10

Outline

Introduction Description of framework for creating

hierarchies Examples Methods of evaluation Future Improvements

Page 11: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

11

Methodology

Build probabilistic word model of documents Find “best” terms

On topic Predictive

Recursive definition creates hierarchy

Page 12: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

12

Term characteristics Why topicality?

Distinguish topic terms from the rest of the vocabularyThe Secretary of Interior listed bald eagles south of the 40th parallel as endangered under the Endangered Species Preservation Act of 1966.

Why predictiveness? Topic words can be strongly related Represent different facets of the vocabulary Example: P(“Endangered”|”Stellar sea lions”) = 1.00

Page 13: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

13

Statistical Model

AT refers to topicality with respect to topic T Find if the word w is in set T

B refers to predictiveness Precondition for other terms to occur Find if word w is in set P

iTiiT

VVTtBAVS , and ) ,(maxarg*)( TP

Page 14: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

14

Probabilistic Word Model

Captures statistical information about text Called a “language model” in speech

recognition Provides basis for estimation of

probabilities

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

0.005

Page 15: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

15

Estimating Topicality

Use term’s contribution to relative entropy Compares two models using K-L divergence

Model of documents in hierarchy Model of general English

Page 16: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

16

KL ExampleLanguage Model Comparison

-20

-15

-10

-5

0

log(

pro

b(t

erm

))

Hierarchy

General English

KL Divergence

-0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Va

lue

D(hier||GE)

mammal

fishery

speciesmarine

endangered

Page 17: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

17

Estimating Predictiveness Relates the vocabulary to a set of candidate

topic terms Use conditional probability - Px (t|v)

x is the maximum distance between t and v

)( .

)( size of windows.)|(

)|(||

1

vn

vtxnvt

vtV

tVvt

x

x

P

P

mammal

marine

mammal

marine

species

fishery

spec

ies

fishery

.98 .31 .35

.99 .31

.65.65

t vP(t|v)

.35

.50

.03.04 .01

Page 18: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

18

Dominating Set Approximation

Interpret predictive language model as graph edges weighted by the conditional probability

Finds terms that are connected to lots of terms with a high weight

Chooses topic terms until vocabulary is dominated (predicted)

Page 19: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

19

Term Selection Example

P(t|v) v

t

Page 20: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

20

Generating a Summary

4-step process(1) Preprocess document set

(2) Generate a language model

(3) Select the terms

(4) Create a Hierarchy

recursive

Page 21: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

21

Outline

Introduction Description of framework for creating

hierarchies Examples Methods of evaluation Future Improvements

Page 22: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

22

Example Hierarchies

Generated from 50 documents retrieved for the query: Endangered Species - Mammals Demonstrate the difference between using

different topic models Web hierarchy using same query

Page 23: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

23

endangered (86)

Act (41)

State (32)

Committee (43)

address (85)

operations (43)

incidental take (42)

NMFS (64)

population (32)

commercial fishing operations (42)

regulations (124)

fish (117)

permit (146)

number (93)

bill (51)

Secretary (73)

research (105)

amended (154)

Uniform Topic Model Hierarchy

marine (187)

species (439)

plan (192)

marine mammals (187)

Page 24: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

24

KL-Topic Model Hierarchy

species (439)

Marine Mammal Protection Act (73)

marine mammals(187)

management plan (51)

marine (187)

Endangered Species Act (294)

endangered species (204)

habitat (283)

mammals (126)

Marine Mammal Commission (21)fish (277)

National Marine Fisheries Service (113)

Act (313)

permit (164)

protection (244)

marine mammal stocks (20)

marine mammal species (42)

fishery (53)

Secretary (42)

NMFS (83)

stock (51)

fish species (32)

MMPA (51)

incidental (74)

research (63)

Page 25: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

26

Web Hierarchies

Submit query to a web search engineGather titles and snippets of documents

Text considered a document Documents are about 30 words

Page 26: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

27

species of marine mammals (1)

Listed Species (1)

Species Information (1)

Endangered Species Act (8)

Protected Resources (2)

sea otter (2)

whales (13)

dolphins (7)

Cetaceans (2)

marine (76)

Mammals species (4)

Canadian Endangered Species (3)

federal Endangered Species (1)

marine mammals (91)

Endangered Mammals (22)

threatened (144)

species of mammals (27)

endangered mammal species (4)

birds (140)

British mammals (4)

animal species (1)

Critically Endangered Mammals (2)

Animal Info (2)

Ecosystems (2)

Scientists (2)

species of marine mammals (1)

Endangered Species Coalition (2)

Endangered Spaces (2)

List of Endangered Species (5)

marine mammals (97)

birds (114)

Endangered Mammals (13)

threatened (78)

small mammals (13)

large mammals (12)

Example of Web Hierarchy

Endangered Species (440)

endangered (491)

mammals (600)

terrestrial mammals (2)

endangered marine species (2)

Species Management (2)

marine species (4)

listing of species (1)

protected species (2)

native species (1)

Candidate species (2)

100 species (1)

new species (1)

Page 27: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

28

Outline

Introduction Description of framework for creating

hierarchies Examples Methods of evaluation Future Improvements

Page 28: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

29

Evaluations Summary Evaluation

Tests how well the topic terms chosen predict the vocabulary

Access Evaluation Compare number of documents a user can find

Relevance Evaluation Path length to find all relevant documents

Page 29: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

30

Automatic Evaluation Test Set

Use 50 standard queries Document sets

500 documents retrieved from TREC volumes 4 and 5 (have relevance judgments)

200 documents retrieved from a news database 1000 titles and snippets retrieved using Google™

Search Engine

Page 30: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

31

Evaluating Hypotheses

Use KL-topic modelUse sub-collections

Sum

mar

y

Acce

ss

Rel

evan

ce

TREC Collection and News Documents

-Denotes an evaluation confirmed hypothesis

-Denotes evaluation showed no significant difference

Page 31: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

32

Web Document Evaluation

Results completely different Best hierarchy uniform topic

model Hierarchies do not look as

good to human inspection

Page 32: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

33

User Study

Include 12 to 16 users Compare ranked list and hierarchy to ranked

list alone Users asked to find all instances that are

relevant to the query Only have to identify one document about a

particular instance Study includes 10 queries

Page 33: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

34

Future Work

Complete user study Failure Analysis Explore the use of topic hierarchies in other

organizational tasks Personal collections of documents E-mails

Page 34: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

35

Conclusions Developed a formal framework for topic

hierarchies Created hierarchies from full text and

snippets of documents Verified intuition concerning hierarchies

generated from full text

Page 35: Hierarchical Summaries

Dawn J. LawrieUniversity of Massachusetts, Amherst

36

Questions?

Demo: http://www-ciir.cs.umass.edu/~lawrie/categories/google-qry/