dynamic faceted search for discovery-driven analysis

32
Dynamic Faceted Search for Discovery-driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18 1

Upload: karlyn

Post on 07-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Dynamic Faceted Search for Discovery-driven Analysis. Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18. Outline. Introduction Terminology and Problem Statement Measure of “Interestingness” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamic Faceted Search for Discovery-driven Analysis

Dynamic Faceted Search for Discovery-

driven Analysis

Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman

CIKM’08

Speaker: Li, Huei-JyunAdvisor: Dr. Koh, Jia-Ling

Date: 2008/12/18

1

Page 2: Dynamic Faceted Search for Discovery-driven Analysis

Outline

Introduction Terminology and Problem Statement Measure of “Interestingness” Implementing Dynamic Faceted Search Evaluation Conclusion and Future work

2

Page 3: Dynamic Faceted Search for Discovery-driven Analysis

Introduction

Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration To preserve browsing consistency, facets

selected for navigation tend to be “static” When browsing online catalogs, the navigational

facets are single-dimensional only

3

Page 4: Dynamic Faceted Search for Discovery-driven Analysis

Introduction

Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems

From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user

4

Page 5: Dynamic Faceted Search for Discovery-driven Analysis

Terminology and Problem Statement

Defn 1. A repository D is a collection of documents

Each of which is composed of some free text and one or more <facet: value> pairs

Given a value f in facet F, we call <F : f> an instance of F All unique values associated with a facet F form

the domain of F

5

Page 6: Dynamic Faceted Search for Discovery-driven Analysis

Terminology and Problem Statement

Defn 2. Organize the domain of these facets into a facet

hierarchy Each node in the hierarchy stores a <facet: value>

pair A node <F1: f1> is the parent of another node <F2: f2>

if for each document, F2 = f2 implies F1 = f1

6

Page 7: Dynamic Faceted Search for Discovery-driven Analysis

Terminology and Problem Statement

Defn 3. Assume a query q on the repository has the

form “keywords && F1 = f1 && F2 = f2…”

The result of q is denoted by Dq Includes the set of documents having the

specified keywords Satisfying all constraints on selected facets

7

Page 8: Dynamic Faceted Search for Discovery-driven Analysis

Terminology and Problem Statement

Defn 4. Given a query q, define a facet summary for

a facet set F1, …, Fm as a list of tuples <f1, …, fm, A(f1, …, fm)> over Dq

fi is an instance of facet Fi

A(f1, …, fm) is an aggregate of documents in Dq that contain all these facet instances

8

Page 9: Dynamic Faceted Search for Discovery-driven Analysis

Terminology and Problem Statement

Problem Definition: Given a repository of documents with n

facets, a query q, 2 integers K1 & K2

select K1 facet sets and a facet summary for each with up to K2 tuples that are the most “interesting” to a user

9

Page 10: Dynamic Faceted Search for Discovery-driven Analysis

Measure of “Interestingness”

Interestingness: How surprising an actual aggregated value is, given a certain expectation

10

Page 11: Dynamic Faceted Search for Discovery-driven Analysis

Measure of “Interestingness”*Setting the Expectation

For a given set of facet values f1, …, fm from F1, …, Fm: CD(f1, …, fm ): the count of the number of

documents with all those facet values in D Cq(f1, …, fm ): the count of the number of documents

with all those facet values in Dq

E[Cq(f1, …, fm )]: an “expected” value for Cq(f1, …, fm ) Natural 、 navigational 、 ad hoc

11

Page 12: Dynamic Faceted Search for Discovery-driven Analysis

Measure of “Interestingness”*Setting the Expectation

Natural: For an individual facet instance <F : f>:

(uniformity assumption)

For an instance f1, …, fm of a facet set: (independence assumption)

12

Page 13: Dynamic Faceted Search for Discovery-driven Analysis

Measure of “Interestingness”*Setting the Expectation

Navigational:

Ad hoc: User can tell the system to set expectation based

on an arbitrary query q of the user’s choice Set the count for each facet value proportionally

based on the distribution of the result of q13

Page 14: Dynamic Faceted Search for Discovery-driven Analysis

Measure of “Interestingness”*Measuring Degree of Interestingness

Single facet instance: By evaluating it with respect to a scenario in

which its associated count is generated by random sampling

The smaller the probability of observing the count under random sampling, the more interesting the facet instance

14

Page 15: Dynamic Faceted Search for Discovery-driven Analysis

Measure of “Interestingness”*Measuring Degree of Interestingness

p-value: Suppose that a certain facet value occurs in r out of

R documents in the repository and in q out of Q documents in the output of a certain query

Also suppose The interestingness of that facet value vis-à-vis the

query: the probability that in a random sample of size Q there will be at least q documents with that facet value hypergeometric distribution normal distribution or

Poisson distribution15

Page 16: Dynamic Faceted Search for Discovery-driven Analysis

Measure of “Interestingness”*Measuring Degree of Interestingness

The whole facet: For each facet F, we consider the p-values of only the

k most interesting values in F , replace

The final measure:

MaxWeight: assign 1 to w1 and 0 to the rest

AvgWeight: assign each wi an equal weight HybridWeight: average the interesingness computed by

MaxWeight and AvgWeight16

Page 17: Dynamic Faceted Search for Discovery-driven Analysis

Implementing Dynamic Faceted Search

Solr: indexes facets without storing them Enumerates every facet instance <F: f> from the

index and intersects its posting list with Dq

From the intersected set, it derives the count on facet value f

Caches each posting list to a bitset If the bitset is dense: bitmap Otherwise: a hash map of document IDs

17

Page 18: Dynamic Faceted Search for Discovery-driven Analysis

Implementing Dynamic Faceted Search

Improving Solr: Solr limitation 1: has to choose a threshold that

decides the representation of the bitset represent a bitset as a compressed bitmap

using Word-Aligned Hybrid (WAH) code

18

Page 19: Dynamic Faceted Search for Discovery-driven Analysis

Implementing Dynamic Faceted Search

WAH There are 2 types of words:

Literal words: a verbatim representation of 31 bits Fill words: encodes the length of a list of all 0’s and 1’s

in 30 bits A bitmap is broken into groups of 31 bits first and

then converted into a sequence of literal and fill words

Operations on bitmaps such as intersection can be performed on WAH code directly without decoding

19

Page 20: Dynamic Faceted Search for Discovery-driven Analysis

Implementing Dynamic Faceted Search

Improving Solr: Solr limitation 2: it has to intersect the matching

document set Dq with the bitset of every facet instance

reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet

20

Page 21: Dynamic Faceted Search for Discovery-driven Analysis

Implementing Dynamic Faceted Search

Building and Using a Bitset Tree Starting with the leaf nodes, for each bitset b

corresponding to facet instance <F: f>, we create an entry <b, null>

Then divide all entries into groups of size s For each group, we generate a leaf node holding all

entries in that group

21

Page 22: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Setup

DBLP Contains about 13,000 papers published in 26 venues

(e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years

It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper

Use the title of each paper as text for keywords searches

Conduct the user survey22

Page 23: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Setup

Patent Has about 1.8 million

U.S. patents from the past 30 years

16 facets organized into 10 hierarchies

Use for performance evaluation

23

Page 24: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Result from a User Survey

Performed tests on 3 keyword queries 2 are provided by author:

“distributed”, “mining” Users pick the 3 keyword

1 base on natural 2 base on navigational

1 used complete repository 1 used previous query

24

Page 25: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Result from a User Survey

25

Page 26: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Result from a User Survey

Our dynamic approach also received some negative feedback

Overall, the feedback for the natural expectation is neutral

Different ways of aggregating the degree of interestingness HybridWeight(7) > MaxWeight(6) > AvgHeight(2)

26

Page 27: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Performance Results

Environment: Implemented in Java 3GHz P4 desktop machine with 1GB memory A single disk drive, running Linux

Version:1. simple: inverted index2. Solr3. compressed: improves Solr by WAH code4. tree: improves Solr by bitset trees5. compressed-tree: both WAH and bitset tree on Solr

27

Page 28: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Performance Results

Scaling with Data Size Run a query that matches 25,000 docs using tree Break the total time into search time & summary

computation time28

Page 29: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Performance Results

29

Page 30: Dynamic Faceted Search for Discovery-driven Analysis

Evaluation*Performance Results

30

Page 31: Dynamic Faceted Search for Discovery-driven Analysis

Conclusion and Future Work

Develop a novel dynamic faceted search system support OLAP-style discovery-driven analysis on a large set of structured and unstructured data

Propose an intuitive and effective way of measuring “interestingness”

Propose a novel navigational ,method of setting a user’s expectation

31

Page 32: Dynamic Faceted Search for Discovery-driven Analysis

Conclusion and Future Work

Incorporate user feedback in facet selection How to extend the aggregates to functions

other than count Sum, average on some numerical measures

How to support dynamic faceted search in a distributed environment

32