dynamic faceted search for discovery-driven analysis

Dynamic Faceted Search for Discovery-

driven Analysis

Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman

CIKM’08

Speaker: Li, Huei-JyunAdvisor: Dr. Koh, Jia-Ling

Date: 2008/12/18

1

Outline

Introduction Terminology and Problem Statement Measure of “Interestingness” Implementing Dynamic Faceted Search Evaluation Conclusion and Future work

2

Introduction

Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration To preserve browsing consistency, facets

selected for navigation tend to be “static” When browsing online catalogs, the navigational

facets are single-dimensional only

3

Introduction

Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems

From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user

4

Terminology and Problem Statement

Defn 1. A repository D is a collection of documents

Each of which is composed of some free text and one or more <facet: value> pairs

Given a value f in facet F, we call <F ： f> an instance of F All unique values associated with a facet F form

the domain of F

5


Defn 2. Organize the domain of these facets into a facet

hierarchy Each node in the hierarchy stores a <facet: value>

pair A node <F1: f1> is the parent of another node <F2: f2>

if for each document, F2 = f2 implies F1 = f1

6


Defn 3. Assume a query q on the repository has the

form “keywords && F1 = f1 && F2 = f2…”

The result of q is denoted by Dq Includes the set of documents having the

specified keywords Satisfying all constraints on selected facets

7


Defn 4. Given a query q, define a facet summary for

a facet set F1, …, Fm as a list of tuples <f1, …, fm, A(f1, …, fm)> over Dq

fi is an instance of facet Fi

A(f1, …, fm) is an aggregate of documents in Dq that contain all these facet instances

8


Problem Definition: Given a repository of documents with n

facets, a query q, 2 integers K1 & K2

select K1 facet sets and a facet summary for each with up to K2 tuples that are the most “interesting” to a user

9

Measure of “Interestingness”

Interestingness: How surprising an actual aggregated value is, given a certain expectation

10

Measure of “Interestingness”*Setting the Expectation

For a given set of facet values f1, …, fm from F1, …, Fm: CD(f1, …, fm ): the count of the number of

documents with all those facet values in D Cq(f1, …, fm ): the count of the number of documents

with all those facet values in Dq

E[Cq(f1, …, fm )]: an “expected” value for Cq(f1, …, fm ) Natural 、 navigational 、 ad hoc

11


Natural: For an individual facet instance <F ： f>:

(uniformity assumption)

For an instance f1, …, fm of a facet set: (independence assumption)

12


Navigational:

Ad hoc: User can tell the system to set expectation based

on an arbitrary query q of the user’s choice Set the count for each facet value proportionally

based on the distribution of the result of q13

Measure of “Interestingness”*Measuring Degree of Interestingness

Single facet instance: By evaluating it with respect to a scenario in

which its associated count is generated by random sampling

The smaller the probability of observing the count under random sampling, the more interesting the facet instance

14


p-value: Suppose that a certain facet value occurs in r out of

R documents in the repository and in q out of Q documents in the output of a certain query

Also suppose The interestingness of that facet value vis-à-vis the

query: the probability that in a random sample of size Q there will be at least q documents with that facet value hypergeometric distribution normal distribution or

Poisson distribution15


The whole facet: For each facet F, we consider the p-values of only the

k most interesting values in F , replace

The final measure:

MaxWeight: assign 1 to w1 and 0 to the rest

AvgWeight: assign each wi an equal weight HybridWeight: average the interesingness computed by

MaxWeight and AvgWeight16

Implementing Dynamic Faceted Search

Solr: indexes facets without storing them Enumerates every facet instance <F: f> from the

index and intersects its posting list with Dq

From the intersected set, it derives the count on facet value f

Caches each posting list to a bitset If the bitset is dense: bitmap Otherwise: a hash map of document IDs

17


Improving Solr: Solr limitation 1: has to choose a threshold that

decides the representation of the bitset represent a bitset as a compressed bitmap

using Word-Aligned Hybrid (WAH) code

18


WAH There are 2 types of words:

Literal words: a verbatim representation of 31 bits Fill words: encodes the length of a list of all 0’s and 1’s

in 30 bits A bitmap is broken into groups of 31 bits first and

then converted into a sequence of literal and fill words

Operations on bitmaps such as intersection can be performed on WAH code directly without decoding

19


Improving Solr: Solr limitation 2: it has to intersect the matching

document set Dq with the bitset of every facet instance

reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet

20


Building and Using a Bitset Tree Starting with the leaf nodes, for each bitset b

corresponding to facet instance <F: f>, we create an entry <b, null>

Then divide all entries into groups of size s For each group, we generate a leaf node holding all

entries in that group

21

Evaluation*Setup

DBLP Contains about 13,000 papers published in 26 venues

(e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years

It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper

Use the title of each paper as text for keywords searches

Conduct the user survey22

Evaluation*Setup

Patent Has about 1.8 million

U.S. patents from the past 30 years

16 facets organized into 10 hierarchies

Use for performance evaluation

23

Evaluation*Result from a User Survey

Performed tests on 3 keyword queries 2 are provided by author:

“distributed”, “mining” Users pick the 3 keyword

1 base on natural 2 base on navigational

1 used complete repository 1 used previous query

24


25


Our dynamic approach also received some negative feedback

Overall, the feedback for the natural expectation is neutral

Different ways of aggregating the degree of interestingness HybridWeight(7) > MaxWeight(6) > AvgHeight(2)

26

Evaluation*Performance Results

Environment: Implemented in Java 3GHz P4 desktop machine with 1GB memory A single disk drive, running Linux

Version:1. simple: inverted index2. Solr3. compressed: improves Solr by WAH code4. tree: improves Solr by bitset trees5. compressed-tree: both WAH and bitset tree on Solr

27


Scaling with Data Size Run a query that matches 25,000 docs using tree Break the total time into search time & summary

computation time28


29


30

Conclusion and Future Work

Develop a novel dynamic faceted search system support OLAP-style discovery-driven analysis on a large set of structured and unstructured data

Propose an intuitive and effective way of measuring “interestingness”

Propose a novel navigational ,method of setting a user’s expectation

31

Conclusion and Future Work

Incorporate user feedback in facet selection How to extend the aggregates to functions

other than count Sum, average on some numerical measures

How to support dynamic faceted search in a distributed environment

32

dynamic faceted search for discovery-driven analysis

Documents