addressing diverse user preferences in sql-query-result navigation sigmod ‘07 zhiyuan chen tao li...

Addressing Diverse User Preferences in SQL-Query-Result Navigation

SIGMOD ‘07

Zhiyuan Chen Tao LiUniversity of Maryland , Baltimore County Florida International University

Recap…

• Exploratory queries on database systems becoming a common phenomenon

• These queries return a large set of results, most of then irrelevant to the user

• Categorization (and ranking) help users locate relevant records

• A user typically does not expend any effort in specifying his/her preferences

Motivation

• Previous work assumed that all users have the same preferences.

• Not true in most scenarios• Ignoring user preferences leads to

construction of sub-optimal navigation trees

Motivation (cont…)

• Key challenges:1. How to summarize diverse user preferences

from the behaviors of all the users in the system?

2. How to decide the subset of preferences associated with a specific user?

System Architecture

Query History

Cluster Generation

Clusters over Data

Navigation tree Construction

Query Execution

Results

Query

System Architecture (cont…)

• Pre-processing step:– Analyze query history and generate a set of (non-

overlapping) clusters over data.– Each cluster corresponds to one type of user

preference– Has an associated probability of users being

interested in that cluster• Assumption: Individual preferences can be

represented as a subset of these clusters

System Architecture (cont…)

• Generation of the navigation tree– Occurs when a specific user asks a query– Intersect the set of clusters generated in the pre-

processing step with the answers of the given query

– Construct a navigation tree over the intersected clusters on the fly

Terminology and Definitions

• Query History (H) is of the form, {(Q1,U1,F1),…,(Qk, Uk, Fk)}, in chronological order– where • Qi is a query

• Ui is a user session ID

• Fi is the weight associated with the query

• Each query Qi is of the form: – Cond(A1) ^ Cond(A2) ^ …. ^ Cond(An)

• Each Cond(Ai) contains only range or equality conditions

Terminology and Definitions (cont…)

• Data (D) is partitioned into disjoint set of clusters C = {C1 , C2, …, Cq }

• Each Ci has an associated probability Pi

• The Pi associated with a cluster denotes the probability that the users are interested in that cluster

Definition : Navigation Tree

• Navigation Tree T(V, E, L)• Satisfying the following conditions :– Each node v has a label label(v) denoting a

Boolean condition.– v contains records that satisfy all conditions on its

ancestors including itself– conditions associated with children of a non-leaf

node v are on the same attribute (called split attribute)

Clusters over Data

• Two records ri and rj are indistinguishable if they always appear in the same set of queries

• Define a binary relation R – (ri ,rj) Є iff the above condition is satisfied

• R is reflexive, symmetric and transitive

=> R is an equivalence relation and partitions D into equivalence classes (clusters) {C1,….,Cq}

Clusters over Data (Example)

9

1

6

8

3

5

104

2

711 12

13

D = {r1 ,….,r13 }Q1 = {r1,…,r10} ; Q2 = {r1,…,r9 and r11} ; Q3 = {r12}

Clusters over Data (Heuristics)

• Problem: Too many clusters!• Apply heuristics to decrease the number of

clusters:– Prune unimportant queries• Remove queries with empty answers• Retain the most specific query in a given session

– Merge similar queries

Clusters over Data (Merge Similar Queries)

• Algorithm:1. Compute result DQi for each query Qi

2. Compute clusters CQi for each query Qi

3. Repeat until no more merging is possible1. Compute distance between each pairs of queries 2. Merge two clusters QCi & QCj that have a distance less

than B

Distance d(Qi,Qj) =

Merge Similar Queries (Example)• Let B = 0.2 • d(CQ1,CQ2) = 1 – 9/11 = 0.18, d(CQ1,CQ3) = 1 , d(CQ2,CQ3) = 1

• Merge CQ1 and CQ2

• Results in 2 query clusters CQ 1 = {r1 ,….,r11 }, CQ2 = {r12}

9

1

6

8

3

5

104

2

711 12

13

Merge Similar Queries (Complexity Results)

• O(|H||D| + |H|3td )– td is the time to compute distance

• Can be improved by – Sampling– Pre-computation of distances • O(|H||D| + |H|2td + |H|2 log|H|)

– Min-wise Hashing• O(|H||D| + |H|2k + |H|2 log|H|)

– K is the hash signature size

Generate Clusters

• QC1,…QCk generated after query pruning and merging

• For each record ri

– Generate a set Ci such that one of the queries in the cluster returns ri

– Group the records by Ci and assign a class-label to Ci

– Compute Pi :

• Sum of frequencies of query in Si divided by the sum of all queries in H (history)

Example: P1 = 2/3, P2 = 1/3 and P3 = 0

Navigation Tree Construction

• Given D, Q and C find a tree T(V, E, L) such that– T contains all records of Q– There does not exist T’ with NCost(T’,C) <

NCost(T’,C)

NCost(T,C):

Navigation Tree Construction

• Optimal-tree construction problem is NP-Hard• Observation: The navigational tree is very similar

to a decision tree. • So, any decision tree construction algorithm can

be used… • Decision tree algorithms compute information

gain to measure how good and attribute classifies data.

• Here, the criteria is to minimize navigation cost

Navigation Tree Construction (Decision Tree Construction)

• Precondition: Each record has a class label assigned in the clustering step

• Algorithm: 1. Create a root R and assign all records to it2. If all the records have the same class label, stop3. Select the attribute that maximizes the (global)

navigation cost (Information gain) to expand the tree for the next level

Navigation Tree Construction (Splitting Criteria)

• Navigation Cost includes– Cost of visiting Leaf nodes – the results– Cost of visiting intermediate nodes – category

labels

Splitting Criteria (Example)A1

(C1, C2, C1, C2)(C1, C2, C1, C2)

A1 <= v1 A1 > v1

A2

(C1, C2, C2, C2)(C1, C1, C1, C2)

A2 <= v2 A2 > v2

P(C1) = P(C2) = 0.5Navigation Cost = 2 + 4 +2+ 4

Which split is better?

Splitting Criteria Cost of visiting Leaf-nodes

• Let t be the node to be split• N(t) be the number of records in t• Let Pi be the probability that users are interested in

cluster Ci

• The gain (reduction in navigation cost) when t is split into t1 and t2 is:

Splitting Criteria Cost of Visiting Intermediate Nodes

• Observation:– Given a perfect tree T with N records and k

classes, where each class Ci in T has Ni records:

approximates the average length of root-to-leaf paths for all records in T

Splitting Criteria Cost of Visiting Intermediate Nodes

t

(C1, C1, C1,…..,C1) (Ck, Ck, Ck,…..,Ck)

…… ……C1C1 CkCk

Log N

Log Ni

Splitting Criteria Combining the two costs

Gain when a node t is split into t1 and t2

Information Gain due to a split:IGain (t, t1, t2) = E(t) – N1/N E(t1) – N2/N E(t2)

addressing diverse user preferences in sql-query-result navigation sigmod ‘07 zhiyuan chen tao li...

Documents

query q i

query u i

conda i

records r i

qc i query cluster

cluster slide

results of query

c i ttree node slide