addressing diverse user preferences in sql-query-result navigation sigmod ‘07 zhiyuan chen tao li...
TRANSCRIPT
Addressing Diverse User Preferences in SQL-Query-Result Navigation
SIGMOD ‘07
Zhiyuan Chen Tao LiUniversity of Maryland , Baltimore County Florida International University
Recap…
• Exploratory queries on database systems becoming a common phenomenon
• These queries return a large set of results, most of then irrelevant to the user
• Categorization (and ranking) help users locate relevant records
• A user typically does not expend any effort in specifying his/her preferences
Motivation
• Previous work assumed that all users have the same preferences.
• Not true in most scenarios• Ignoring user preferences leads to
construction of sub-optimal navigation trees
Motivation (cont…)
• Key challenges:1. How to summarize diverse user preferences
from the behaviors of all the users in the system?
2. How to decide the subset of preferences associated with a specific user?
System Architecture
Query History
Cluster Generation
Clusters over Data
Navigation tree Construction
Query Execution
Results
Query
System Architecture (cont…)
• Pre-processing step:– Analyze query history and generate a set of (non-
overlapping) clusters over data.– Each cluster corresponds to one type of user
preference– Has an associated probability of users being
interested in that cluster• Assumption: Individual preferences can be
represented as a subset of these clusters
System Architecture (cont…)
• Generation of the navigation tree– Occurs when a specific user asks a query– Intersect the set of clusters generated in the pre-
processing step with the answers of the given query
– Construct a navigation tree over the intersected clusters on the fly
Terminology and Definitions
• Query History (H) is of the form, {(Q1,U1,F1),…,(Qk, Uk, Fk)}, in chronological order– where • Qi is a query
• Ui is a user session ID
• Fi is the weight associated with the query
• Each query Qi is of the form: – Cond(A1) ^ Cond(A2) ^ …. ^ Cond(An)
• Each Cond(Ai) contains only range or equality conditions
Terminology and Definitions (cont…)
• Data (D) is partitioned into disjoint set of clusters C = {C1 , C2, …, Cq }
• Each Ci has an associated probability Pi
• The Pi associated with a cluster denotes the probability that the users are interested in that cluster
Definition : Navigation Tree
• Navigation Tree T(V, E, L)• Satisfying the following conditions :– Each node v has a label label(v) denoting a
Boolean condition.– v contains records that satisfy all conditions on its
ancestors including itself– conditions associated with children of a non-leaf
node v are on the same attribute (called split attribute)
Clusters over Data
• Two records ri and rj are indistinguishable if they always appear in the same set of queries
• Define a binary relation R – (ri ,rj) Є iff the above condition is satisfied
• R is reflexive, symmetric and transitive
=> R is an equivalence relation and partitions D into equivalence classes (clusters) {C1,….,Cq}
Clusters over Data (Example)
9
1
6
8
3
5
104
2
711 12
13
D = {r1 ,….,r13 }Q1 = {r1,…,r10} ; Q2 = {r1,…,r9 and r11} ; Q3 = {r12}
Clusters over Data (Heuristics)
• Problem: Too many clusters!• Apply heuristics to decrease the number of
clusters:– Prune unimportant queries• Remove queries with empty answers• Retain the most specific query in a given session
– Merge similar queries
Clusters over Data (Merge Similar Queries)
• Algorithm:1. Compute result DQi for each query Qi
2. Compute clusters CQi for each query Qi
3. Repeat until no more merging is possible1. Compute distance between each pairs of queries 2. Merge two clusters QCi & QCj that have a distance less
than B
Distance d(Qi,Qj) =
Merge Similar Queries (Example)• Let B = 0.2 • d(CQ1,CQ2) = 1 – 9/11 = 0.18, d(CQ1,CQ3) = 1 , d(CQ2,CQ3) = 1
• Merge CQ1 and CQ2
• Results in 2 query clusters CQ 1 = {r1 ,….,r11 }, CQ2 = {r12}
9
1
6
8
3
5
104
2
711 12
13
Merge Similar Queries (Complexity Results)
• O(|H||D| + |H|3td )– td is the time to compute distance
• Can be improved by – Sampling– Pre-computation of distances • O(|H||D| + |H|2td + |H|2 log|H|)
– Min-wise Hashing• O(|H||D| + |H|2k + |H|2 log|H|)
– K is the hash signature size
Generate Clusters
• QC1,…QCk generated after query pruning and merging
• For each record ri
– Generate a set Ci such that one of the queries in the cluster returns ri
– Group the records by Ci and assign a class-label to Ci
– Compute Pi :
• Sum of frequencies of query in Si divided by the sum of all queries in H (history)
Example: P1 = 2/3, P2 = 1/3 and P3 = 0
Navigation Tree Construction
• Given D, Q and C find a tree T(V, E, L) such that– T contains all records of Q– There does not exist T’ with NCost(T’,C) <
NCost(T’,C)
NCost(T,C):
Navigation Tree Construction
• Optimal-tree construction problem is NP-Hard• Observation: The navigational tree is very similar
to a decision tree. • So, any decision tree construction algorithm can
be used… • Decision tree algorithms compute information
gain to measure how good and attribute classifies data.
• Here, the criteria is to minimize navigation cost
Navigation Tree Construction (Decision Tree Construction)
• Precondition: Each record has a class label assigned in the clustering step
• Algorithm: 1. Create a root R and assign all records to it2. If all the records have the same class label, stop3. Select the attribute that maximizes the (global)
navigation cost (Information gain) to expand the tree for the next level
Navigation Tree Construction (Splitting Criteria)
• Navigation Cost includes– Cost of visiting Leaf nodes – the results– Cost of visiting intermediate nodes – category
labels
Splitting Criteria (Example)A1
(C1, C2, C1, C2)(C1, C2, C1, C2)
A1 <= v1 A1 > v1
A2
(C1, C2, C2, C2)(C1, C1, C1, C2)
A2 <= v2 A2 > v2
P(C1) = P(C2) = 0.5Navigation Cost = 2 + 4 +2+ 4
Which split is better?
Splitting Criteria Cost of visiting Leaf-nodes
• Let t be the node to be split• N(t) be the number of records in t• Let Pi be the probability that users are interested in
cluster Ci
• The gain (reduction in navigation cost) when t is split into t1 and t2 is:
Splitting Criteria Cost of Visiting Intermediate Nodes
• Observation:– Given a perfect tree T with N records and k
classes, where each class Ci in T has Ni records:
approximates the average length of root-to-leaf paths for all records in T
Splitting Criteria Cost of Visiting Intermediate Nodes
t
(C1, C1, C1,…..,C1) (Ck, Ck, Ck,…..,Ck)
…… ……C1C1 CkCk
Log N
Log Ni
Splitting Criteria Combining the two costs
Gain when a node t is split into t1 and t2
Information Gain due to a split:IGain (t, t1, t2) = E(t) – N1/N E(t1) – N2/N E(t2)