finding skyline nodes in large networks. evaluation metrics: distance from the query node. (john) ...

20
Finding Skyline Nodes in Large Networks Arijit Khan* Vishwakarma Singh* Jian Wu # *Computer Science, University of California, Santa Barbara, USA # College of Computer Science, Zhejiang University, China {arijitkhan, vsingh}@cs.ucsb.edu, [email protected]

Upload: guadalupe-becher

Post on 14-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Finding Skyline Nodes in Large Networks

Arijit Khan* Vishwakarma Singh* Jian Wu#

*Computer Science, University of California, Santa Barbara, USA#College of Computer Science, Zhejiang University, China

{arijitkhan, vsingh}@cs.ucsb.edu, [email protected]

Query in LinkedIn Network: If John is interested in Big Data, Cloud Computing, and Map Reduce, who will be the top-5 people John should ask about these topics?

Evaluation Metrics:

Distance from the query node. (John)

Coverage of the Query Topics. (Big Data, Cloud Computing, Map Reduce)

Motivation

Finding Skyline Nodes in Large Networks 2

Homogeneous Approach ?

Finding Skyline Nodes in Large Networks 3

Score = λ . Distance + (1- λ ). Coverage

How to get λ ?

Query in LinkedIn Network: If John is interested in Big Data, Cloud Computing, and Map Reduce, who will be the top-5 people John should ask about these topics?

Weighted Set Cover ?

Finding Skyline Nodes in Large Networks 4

Find nodes with smallest aggregate distance from the query node, such that they cover all query topics.

Ignore some interesting nodes.

Cannot rank the results.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = q

Graph Skyline

Finding Skyline Nodes in Large Networks 5

Dominance on Coverage: u >c v Query topics covered by node u is a

superset of the query topics covered by node v.

Dominance on Distance: u >d v Distance of u from q is less than that

of v from q.

Dominance: u > v (1) u >c v and u ≥d v ; or (2) u ≥c v and u >d v.

Graph Skyline: A node is a skyline node if it is not dominated by any other node in the network.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = q

Ranking of Skyline Nodes

Finding Skyline Nodes in Large Networks 6

Too many skyline nodes.

Rank them.

Problem Statement: Given a query node and a set of query topics in a network, find the top-k skyline nodes with maximum dominance count.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = q

Dominance Count: # nodes dominated by a skyline node. [Lin et. al., ICDE ‘07]

Higher Dominance Count => more pruning from candidate set.

1. DC(u4) = {u5, u6, u7}, 2. DC(u1) = {u5} 3. DC(u2) = Φ; 4. DC(u3) = Φ

Algorithm

Finding Skyline Nodes in Large Networks 7

Construct a Query DAG. Three variables associated with each DAG node: Count (C), Dominance (D), Traversal (T).

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = qabc

ab ac bc

a b c

Input Network Query DAG

Naïve Complexity: O(n2r) Complexity with Preprocessing: O(nr2)

C = 0D = -T = -

C = 2D = -T = -

C = 0D = -T = -

C = 2D = -T = -

C = 0D = -T = -

C = 1D = -T = -

C = 2D = -T = -

Query DAG Construction

Finding Skyline Nodes in Large Networks 8

Preprocessing: For each label, find a sorted list of nodes that contain the label. Online Query DAG Construction: Incremental DAG construction.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = q

ab

a b

c

u4 u7

u1 u5 u2

u3 u4 u6 u7

Query DAG Construction (cont.)

Finding Skyline Nodes in Large Networks 9

Preprocessing: For each label, find a sorted list of nodes that contains the label. Online Query DAG Construction: Consider the labels and their sorted lists in order.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = q

ab

a b c

u1 u5 u2 u3

u4

u6

u7abc

Query DAG Construction (cont.)

Finding Skyline Nodes in Large Networks 10

Preprocessing: For each label, find a sorted list of nodes that contains the label. Online Query DAG Construction: Consider the labels and their sorted lists in order.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = q

ab

a b c

u1 u5 u2 u3

u4

u6

u7abc

ac bc

Find Dominance Variable

Finding Skyline Nodes in Large Networks 11

Perform a topological ordering of the DAG nodes to evaluate the Dominance variable (D) of each DAG node. # Nodes dominated (or equal) by coverage.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4u5 u6

u7 u8

u0 = qabc

ab ac bc

a b c

Input Network Query DAG

Naïve Complexity: O(n2r) Complexity by Topological Ordering: O(3r)

C = 0D = 3T = -

C = 2D = 2T = -

C = 0D = 4T = -

C = 2D = 7T = -

C = 0D = 3T = -

C = 1D = 1T = -

C = 2D = 2T = -

Find Traversal Variable

Finding Skyline Nodes in Large Networks 12

Perform a Breadth First Search (BFS) starting from the query node. # Nodes not dominated by distance.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4 u5u6

u7 u8

u0 = qabc

ab ac bc

a b c

Input Network Query DAG

Complexity by BFS: O(n+e)

C = 0D = 3T = 0

C = 2D = 2T = 2

C = 0D = 4T = 0

C = 2D = 7T = 1

C = 0D = 3T = 0

C = 1D = 1T = 1

C = 2D = 2T = 2

h =2

Find Skyline Nodes

Finding Skyline Nodes in Large Networks 13

Store DAG nodes into a Lookup Table. Skyline Bit for each DAG node. Helps to prune non-skyline nodes directly.

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4 u5 u6

u7 u8

u0 = qabc

ab ac bc

a b c

Input Network Query DAG

h =1

abc 0

ab 0

ac 0

bc 0

a 1

b 1

c 1

Lookup Table

a b c

Find Skyline Nodes (cont.)

Finding Skyline Nodes in Large Networks 14

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4 u5 u6

u7 u8

u0 = qabc

ab ac bc

a b c

Input Network Query DAG

h =2

abc 1

ab 1

ac 1

bc 1

a 1

b 1

c 1

Lookup Table

Store DAG nodes into a Lookup Table. Skyline Bit for each DAG node. Helps to prune non-skyline nodes directly.

Dominance Count of Skyline Nodes

Finding Skyline Nodes in Large Networks 15

a b c

abc a cd

abc de

Q = { a, b, c }

u1 u2 u3

u4 u5 u6

u7 u8

u0 = qabc

ab ac bc

a b c

Input Network Query DAG

h =2

abc 1

ab 1

ac 1

bc 1

a 1

b 1

c 1

Lookup Table

C = 0D = 3T = 0

C = 2D = 2T = 1

C = 0D = 4T = 0

C = 2D = 7T = 0

C = 0D = 3T = 0

C = 1D = 1T = 1

C = 2D = 2T = 1

DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

Top-k Buffer to store top-k skyline nodes.

Pruning and Early Termination

Finding Skyline Nodes in Large Networks 16

DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

Top-k Buffer to store top-k skyline nodes. Top-k Pruning: Dominance Variable of a DAG node has smaller value than the smallest Dominance Count in the top-k buffer.

Early Termination: Skyline Bits of all entries in the Lookup Table are 1’s.

Experimental Results

Finding Skyline Nodes in Large Networks 17

DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

Top-k Buffer to store top-k skyline nodes. DBLP: 0.7M Nodes, 3M Edges, 10 Node Labels (distinct). 5 Query Topics.

Efficiency

Finding Skyline Nodes in Large Networks 18

DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

Top-k Buffer to store top-k skyline nodes. DBLP: 185M Nodes, 90M Edges, 1000 Node Labels (distinct). 5 Query Topics, Top-5 Result Nodes.

Conclusion and Future Works

Finding Skyline Nodes in Large Networks 19

DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

Top-k Buffer to store top-k skyline nodes. Efficient Algorithm to find top-k skyline nodes in large attributed

network.

Required experimental evaluation in real and synthetic datasets.

Time Complexity is linear in the number of nodes and edges in the network. Distance based indexing might improve the efficiency.

Top-k Skyline set instead of Top-k Skyline nodes might be more effective.

Questions

Finding Skyline Nodes in Large Networks 20

DC(u4) = D(abc)-T(abc)-T(ab)-T(ac)-T(bc)-T(a)-T(b)-T(c)-1 = 3

Top-k Buffer to store top-k skyline nodes.

Thank You ! ! !