using trees to depict a forest bin liu, h. v. jagadish eecs, university of michigan, ann arbor...

1

Using Trees to Depict a ForestBin Liu, H. V. JagadishEECS, University of Michigan, Ann Arbor

Presented bySergey Shepshelvich

2

Motivation

In interactive database querying, we often get more results than we can comprehend immediately

When do you actually click over 2-3 pages of results? 85% of users never go to the second page! What to display on the first page?

3

Standard solutions

Sorting by attributes Computationally expensive Similar results can be distributed many pages

apart Ranking

Hard to estimate of the user's preference. In database queries, all tuples are equally

relevant! What to do when there are millions of results?

4

Make the First Page Count

Human beings are very capable of learning from examples

Show the most “representative” results Best help users learn what is in the result set User can decide further actions based on

representatives

5

The Proposal:MusiqLens Experience(Model-driven Usable Systems for Information Querying)

Suppose a user wants a 2005 Civic6

but there are too many of them…

7

MusiqLens on the Car Data

Id Model Price Year Mileage Condition

872 Civic $12,000 2005 50,000 Good 122 more like this

901 Civic $16,000 2005 40,000 Excellent 345 more like this



132 Civic $9,500 2005 86,000 Fair 185 more like this


8

MusiqLens on the Car Data








9

After Zooming in:2005 Honda Civics ~ ID 132








10

After Filtering by “Price < 9,500”








11

Challenges

Representation Modeling: finding a suitable metric What is the best set of representatives?

Representative finding How to find them efficiently?

Query Refinement How to efficiently adapt to user’s query

operations?

12

Finding a Suitable Metric

Users should be the ultimate judge Which metric generates the representatives that

I can learn the most from? User study to evaluate different

representation modeling

Metric Candidates

Sort by attributes Uniform random sampling

Small clusters are missed Density-biased sampling

Sample more from sparse regions, less from dense regions

Sort by typicality Based on probabilistic modeling

K-medoids

14

Metric Candidates - K-medoids

A medoid of a cluster is the object whose dissimilarity to others is smallest Average medoid and max medoid

K-medoids are k objects, each from a different cluster where the object is the medoid

Why not K-means? K-means cluster centers do not exist in

database We must present real objects to users

16

17

Plotting the Candidates

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Random

Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Density Biased

18

Plotting the Candidates - Typicality

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Typical

19

Plotting the Candidates –k-medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Max-Medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Avg-Medoids

20

User Study Procedure

Users are given: 7 sets of data, generated using the 7 candidate

methods Each set consists of 8 representative points

Users predict 4 more data points That are most likely in the data set Should not pick those already given

Measure the predication error

21

Verdict

K-meoids is the winner In this paper, authors choose average k-

medoids Proposed algorithm can be extended to max-

medoids with small changes

22

Challenges




operations?

23

Cover Tree Based Algorithm

Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006

Briefly discuss Cover Tree properties See Cover Tree based algorithms for

computing k-medoids

24

Cover Tree Properties (1)Nesting: for all ,

C i

C i+1

Points in the Data (One Dimension)

25

Cover Tree Properties (2)

Distance from node to any descendant is less than . This value is called the “span” of the node.

Covering: node in is within distance of to its children in

C i

C i+1

26

Cover Tree Properties (3)Separation: nodes in are separated by at least

Note: allowed to be negative to satisfy above conditions.

C i

C i+1

27

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Additional Stats for Cover Tree (2D Example)

Density (DS): number of points in the subtree

DS = 10

DS = 3

Centroid (CT): geometric center of points in the subtree

p

28

k-medoid Algorithm Outline

We descend the cover tree to a level with more than nodes

Choose an initial points as first set of medoids (seeds) Bad seeds can lead to local minimums with a

high distance cost Assigning nodes and repeated update until

medoids converge

29

Cover Tree Based Seeding

Descend the cover tree to a level with more than nodes (denote as level m)

Use the parent level as starting point for seeds Each node has a weight, calculated as product of

span and density (the contribution of the subtree to the distance cost)

Expand nodes using a priority queue Fetch the first nodes from the queue as seeds

A Simple Example: k = 4

30

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Span = 2

Span = 1

Span = 1/2

Span = 1/4

Priority Queue on node weight (density * span):

S3 (5), S8 (3), S5 (2)

S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2)

Final set of seeds

31

Update Process

1. Initially, assign all nodes to closest seed to form clusters

2. For each cluster, calculate the geometric center

Use centroid and density information to approximate subtree

3. Find the node that is closest to the geometric center, designate as a new medoid

4. Repeat from step 1 until medoids converge

32

Challenges




operations?

33

Query Adaptation

Handle user actions Zooming Selection (filtering)

34

Zooming

Zooming Expand all nodes assigned to the medoid Run k-medoid algorithm on the new set of nodes

35

Selection

Effect of selection on a node Completely invalid Fully valid Partially valid

Estimate the validity percentage (VG) of each node

Multiply the VG with weight of each node

50

150

A

Mileage

S1S2

S3S4

S5S6

S7

a

Price

1200030

201

45

57

90b

37

Experiments – Initial Medoid Quality Compare with R-tree based method by M.

Ester, H. Kriegel, and X. Xu Data sets

Synthetic dataset: 2D points with zipf distribution

Real dataset: LA data set from R-tree Portal, 130k points

Measurement Time to compute the medoids Average distance from a data point to its medoid

Results on Synthetic Data

38

256K 512K 1024K 2048K 4096K-1.73472347597681E-18

0.002

0.004

0.006

0.008

0.01

R-tree

Cover Tree

Cardinality

Tim

e (s

econ

ds)

256K 512K 1024K 2048K 4096K0

100

200

300

400

500

600

700

800

R-tree

Cover Tree

CardinalityDi

stan

ce

For various sizes of data, Cover-tree based method outperforms R-tree based method

Time Distance

39

Results on Real Data

2 8 32 128 5120

200

400

600

800

1000

1200

1400

1600

R-tree

Cover Tree

k

Dist

ance

2 8 32 128 5120

0.01

0.02

0.03

0.04

0.05

0.06

R-tree

Cover Tree

k

Tim

e (s

econ

ds)

For various k values, Cover-tree based method outperforms R-tree based method on real data

Query Adaptation

40

0.8 0.6 0.4 0.20

100

200

300

400

500

600

Re-Compute

Incremental

Selectivity

Dis

tanc

e

0.8 0.6 0.4 0.20

50

100

150

200

250

300

350

Re-Compute

Incremental

Selectivity

Dis

tanc

e

Synthetic Data Real Data

Compare with re-building the cover tree and running the k-medoid algorithm from scratch.

Time cost of re-building is orders-of-magnitude higher than incremental computation.

41

Conclusion

Authors proposed MusiqLens framework for solving the many-answer problem

Authors conducted user study to select a metric for choosing representatives

Authors proposed efficient method for computing and maintaining the representatives under user actions

Part of the database usability project at Univ. of Michigan Led by Prof. H.V. Jagadish http://www.eecs.umich.edu/db/usable/

using trees to depict a forest bin liu, h. v. jagadish eecs, university of michigan, ann arbor...

Documents

information querying

suitable metric users

users preference

set of candidates users

pages of results

users query operations

user study

best set of representatives