hierarchical summaries
DESCRIPTION
Hierarchical Summaries. for Search. By: Dawn J. Lawrie University of Massachusetts, Amherst. The Problem. Possible Solution. Possible Solution. Solution: Automatic Hierarchies. Strengths of Automatic Hierarchies. Word-based summary Focus on topics of the documents - PowerPoint PPT PresentationTRANSCRIPT
Hierarchical Summaries
By: Dawn J. LawrieUniversity of Massachusetts, Amherst
for Search
Dawn J. LawrieUniversity of Massachusetts, Amherst
2
The Problem
Dawn J. LawrieUniversity of Massachusetts, Amherst
3
Possible Solution
Dawn J. LawrieUniversity of Massachusetts, Amherst
4
Possible Solution
Dawn J. LawrieUniversity of Massachusetts, Amherst
5
Solution: Automatic Hierarchies
Dawn J. LawrieUniversity of Massachusetts, Amherst
6
Strengths of Automatic Hierarchies
Word-based summary Focus on topics of the documents Allows users to navigate through the results Easy to understand Bonus: Useful for summarizing documents
Dawn J. LawrieUniversity of Massachusetts, Amherst
7
Endangered Animals (2910)
marine mammals (188)
Hand-generated hierarchy of 50 documents Query: “Endangered Species (Mammals)”
Example
legislation (64)
permits (102)
Critical Habitat (160)
Endangered plants (70)
Ecosystem Management (20)
Threatened (10)
Endangered Species Act (10)
mammals (1710)
fish (70)
birds (30)
insects (30)
amphibians (10)
sea lions (22)
manatees (11)
whales (74)
jaguars (20)
marine (128)
deer (11)
habitat protection (11)
rats (10)
Hawaii (30)
California (20)
Utah (10)
Virginia (10)
Melicope Species (10)
Wainae Plant Cluster Recovery Plan (10)
Waianae Mountains (10)
Dawn J. LawrieUniversity of Massachusetts, Amherst
8
Proposed Framework
DocumentSet
LanguageModel
Term SelectionAlgorithm
Hierarchy
“Term” = word or phrase
Dawn J. LawrieUniversity of Massachusetts, Amherst
9
Challenges
Selecting terms for the hierarchy Displaying the hierarchy Showing that it works
Dawn J. LawrieUniversity of Massachusetts, Amherst
10
Outline
Introduction Description of framework for creating
hierarchies Examples Methods of evaluation Future Improvements
Dawn J. LawrieUniversity of Massachusetts, Amherst
11
Methodology
Build probabilistic word model of documents Find “best” terms
On topic Predictive
Recursive definition creates hierarchy
Dawn J. LawrieUniversity of Massachusetts, Amherst
12
Term characteristics Why topicality?
Distinguish topic terms from the rest of the vocabularyThe Secretary of Interior listed bald eagles south of the 40th parallel as endangered under the Endangered Species Preservation Act of 1966.
Why predictiveness? Topic words can be strongly related Represent different facets of the vocabulary Example: P(“Endangered”|”Stellar sea lions”) = 1.00
Dawn J. LawrieUniversity of Massachusetts, Amherst
13
Statistical Model
AT refers to topicality with respect to topic T Find if the word w is in set T
B refers to predictiveness Precondition for other terms to occur Find if word w is in set P
iTiiT
VVTtBAVS , and ) ,(maxarg*)( TP
Dawn J. LawrieUniversity of Massachusetts, Amherst
14
Probabilistic Word Model
Captures statistical information about text Called a “language model” in speech
recognition Provides basis for estimation of
probabilities
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
0.005
Dawn J. LawrieUniversity of Massachusetts, Amherst
15
Estimating Topicality
Use term’s contribution to relative entropy Compares two models using K-L divergence
Model of documents in hierarchy Model of general English
Dawn J. LawrieUniversity of Massachusetts, Amherst
16
KL ExampleLanguage Model Comparison
-20
-15
-10
-5
0
log(
pro
b(t
erm
))
Hierarchy
General English
KL Divergence
-0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Va
lue
D(hier||GE)
mammal
fishery
speciesmarine
endangered
Dawn J. LawrieUniversity of Massachusetts, Amherst
17
Estimating Predictiveness Relates the vocabulary to a set of candidate
topic terms Use conditional probability - Px (t|v)
x is the maximum distance between t and v
)( .
)( size of windows.)|(
)|(||
1
vn
vtxnvt
vtV
tVvt
x
x
P
P
mammal
marine
mammal
marine
species
fishery
spec
ies
fishery
.98 .31 .35
.99 .31
.65.65
t vP(t|v)
.35
.50
.03.04 .01
Dawn J. LawrieUniversity of Massachusetts, Amherst
18
Dominating Set Approximation
Interpret predictive language model as graph edges weighted by the conditional probability
Finds terms that are connected to lots of terms with a high weight
Chooses topic terms until vocabulary is dominated (predicted)
Dawn J. LawrieUniversity of Massachusetts, Amherst
19
Term Selection Example
P(t|v) v
t
Dawn J. LawrieUniversity of Massachusetts, Amherst
20
Generating a Summary
4-step process(1) Preprocess document set
(2) Generate a language model
(3) Select the terms
(4) Create a Hierarchy
recursive
Dawn J. LawrieUniversity of Massachusetts, Amherst
21
Outline
Introduction Description of framework for creating
hierarchies Examples Methods of evaluation Future Improvements
Dawn J. LawrieUniversity of Massachusetts, Amherst
22
Example Hierarchies
Generated from 50 documents retrieved for the query: Endangered Species - Mammals Demonstrate the difference between using
different topic models Web hierarchy using same query
Dawn J. LawrieUniversity of Massachusetts, Amherst
23
endangered (86)
Act (41)
State (32)
Committee (43)
address (85)
operations (43)
incidental take (42)
NMFS (64)
population (32)
commercial fishing operations (42)
regulations (124)
fish (117)
permit (146)
number (93)
bill (51)
Secretary (73)
research (105)
amended (154)
Uniform Topic Model Hierarchy
marine (187)
species (439)
plan (192)
marine mammals (187)
Dawn J. LawrieUniversity of Massachusetts, Amherst
24
KL-Topic Model Hierarchy
species (439)
Marine Mammal Protection Act (73)
marine mammals(187)
management plan (51)
marine (187)
Endangered Species Act (294)
endangered species (204)
habitat (283)
mammals (126)
Marine Mammal Commission (21)fish (277)
National Marine Fisheries Service (113)
Act (313)
permit (164)
protection (244)
marine mammal stocks (20)
marine mammal species (42)
fishery (53)
Secretary (42)
NMFS (83)
stock (51)
fish species (32)
MMPA (51)
incidental (74)
research (63)
Dawn J. LawrieUniversity of Massachusetts, Amherst
26
Web Hierarchies
Submit query to a web search engineGather titles and snippets of documents
Text considered a document Documents are about 30 words
Dawn J. LawrieUniversity of Massachusetts, Amherst
27
species of marine mammals (1)
Listed Species (1)
Species Information (1)
Endangered Species Act (8)
Protected Resources (2)
sea otter (2)
whales (13)
dolphins (7)
Cetaceans (2)
marine (76)
Mammals species (4)
Canadian Endangered Species (3)
federal Endangered Species (1)
marine mammals (91)
Endangered Mammals (22)
threatened (144)
species of mammals (27)
endangered mammal species (4)
birds (140)
British mammals (4)
animal species (1)
Critically Endangered Mammals (2)
Animal Info (2)
Ecosystems (2)
Scientists (2)
species of marine mammals (1)
Endangered Species Coalition (2)
Endangered Spaces (2)
List of Endangered Species (5)
marine mammals (97)
birds (114)
Endangered Mammals (13)
threatened (78)
small mammals (13)
large mammals (12)
Example of Web Hierarchy
Endangered Species (440)
endangered (491)
mammals (600)
terrestrial mammals (2)
endangered marine species (2)
Species Management (2)
marine species (4)
listing of species (1)
protected species (2)
native species (1)
Candidate species (2)
100 species (1)
new species (1)
Dawn J. LawrieUniversity of Massachusetts, Amherst
28
Outline
Introduction Description of framework for creating
hierarchies Examples Methods of evaluation Future Improvements
Dawn J. LawrieUniversity of Massachusetts, Amherst
29
Evaluations Summary Evaluation
Tests how well the topic terms chosen predict the vocabulary
Access Evaluation Compare number of documents a user can find
Relevance Evaluation Path length to find all relevant documents
Dawn J. LawrieUniversity of Massachusetts, Amherst
30
Automatic Evaluation Test Set
Use 50 standard queries Document sets
500 documents retrieved from TREC volumes 4 and 5 (have relevance judgments)
200 documents retrieved from a news database 1000 titles and snippets retrieved using Google™
Search Engine
Dawn J. LawrieUniversity of Massachusetts, Amherst
31
Evaluating Hypotheses
Use KL-topic modelUse sub-collections
Sum
mar
y
Acce
ss
Rel
evan
ce
TREC Collection and News Documents
-Denotes an evaluation confirmed hypothesis
-Denotes evaluation showed no significant difference
Dawn J. LawrieUniversity of Massachusetts, Amherst
32
Web Document Evaluation
Results completely different Best hierarchy uniform topic
model Hierarchies do not look as
good to human inspection
Dawn J. LawrieUniversity of Massachusetts, Amherst
33
User Study
Include 12 to 16 users Compare ranked list and hierarchy to ranked
list alone Users asked to find all instances that are
relevant to the query Only have to identify one document about a
particular instance Study includes 10 queries
Dawn J. LawrieUniversity of Massachusetts, Amherst
34
Future Work
Complete user study Failure Analysis Explore the use of topic hierarchies in other
organizational tasks Personal collections of documents E-mails
Dawn J. LawrieUniversity of Massachusetts, Amherst
35
Conclusions Developed a formal framework for topic
hierarchies Created hierarchies from full text and
snippets of documents Verified intuition concerning hierarchies
generated from full text
Dawn J. LawrieUniversity of Massachusetts, Amherst
36
Questions?
Demo: http://www-ciir.cs.umass.edu/~lawrie/categories/google-qry/