Download - 1 QSX: Querying Social Graphs Graph Queries and Algorithms Graph search (traversal) PageRank Nearest neighbors Keyword search Graph pattern matching

1

QSX: Querying Social Graphs

Graph Queries and Algorithms

Graph search (traversal)

PageRank

Nearest neighbors

Keyword search

Graph pattern matching

2

Basic graph queries and algorithms


PageRank

Nearest neighbors

Keyword search

Graph pattern matching (a full treatment of itself)

Widely used in graph algorithms

Algorithms in MapReduce (lecture 4) and other parallel models (lecture 5)Algorithms in MapReduce (lecture 4) and other parallel models (lecture 5)

Graph representation

33

Gen

4

Directed graph G = (V, E, fA) attributes fA(u): a tuple (A1 = a1, ..., An = an)

Social Graphs

Med

Soc Eco

AI

Chem

(‘dept’=CS, ‘field’=AI) (‘dept’=CS, ‘field’=AI)

(‘dept’=CS, ‘field’=DB) (‘dept’=CS, ‘field’=DB) (‘dept’=Bio, ‘field’=Gen) (‘dept’=Bio, ‘field’=Gen)

(‘dept’=Bio, ‘field’=Eco) (‘dept’=Bio, ‘field’=Eco)

Social graphs: attributes, extensible with edge labels

DBDB

label, keywords, blogs, comments, rating …label, keywords, blogs, comments, rating …

4

EcoEco

5

Relational representation

Node relation: node( nodeId, label, attributes)e.g., node(02, book, “Graphs”), node(03, author, “John Doe”)

Edge relation: edge( sid, did, label)sid, did: source and destination nodes; e.g., edge(03, 02, write)

Possibly with an attribute relation

Nontrivial to leverage existing relational techniques

Pros and cons Lossless: the original graph can be reconstructed Ignore topological structure Querying efficiency: Requires multi-table joins or self joins for

simple queries, e.g., book[author=“Bush”]/chapter/title requires 3 joins of the edge relation!

Adjacency Matrices

Represent a graph as an n x n square matrix M– n = |V|– Mij = 1 means a link from node i to j

1 2 3 41 0 1 0 12 1 0 1 13 1 0 0 04 1 0 1 0

1

2

3

4

6

Pros and cons Connectivity: O(1) time (rows: outlinks, columns: inlinks)

Too costly to be practical: when G is large, M is |V|2

Inefficiency: real-life graphs are often sparse

Adjacency Lists

Each node carries a list of adjacent edges

1: 2, 42: 1, 3, 43: 14: 1, 3

1 2 3 41 0 1 0 12 1 0 1 13 1 0 0 04 1 0 1 0

7

Pros and cons

efficient, especially for sparse graphs;

easy to compute over outlinks

difficult to compute over inlinks


88

9

Path queries

Reachability• Input: A directed graph G, and a pair of nodes s and t in G• Question: Does there exist a path from s to t in G?

What do you know about these? 9

Distance• Input: A directed weighted graph G, and a node s in G• Output: The lengths of shortest paths from s to all nodes in G

Regular path• Input: A node-labeled directed graph G, a pair of nodes s and t

in G, and a regular expression R• Question: Does there exist a path p from s to t that satisfies R?

10

Reachability queries

How to evaluate reachability queries?10

Applications: a routine operation• Social graphs: are two people related for security reasons?

• Biological networks: find genes that are (directly or indirectly)

influenced by a given molecule

Nodes: molecules, reactions or physical interections

Edges: interactions

Reachability• Input: A directed graph G, and a pair of nodes s and t in G• Question: Does there exist a path from s to t in G?

11

Breadth-first search

What is the complexity?11

BFS (G, s, t):

1. while Que is nonempty do

a. v Que.dequeue();

b. if v = t then return true;

c. for all adjacent edges e = (v, u) of v do

a) if not flag(u)

then flag(u) true; enqueue u onto Que;

2. return false

Use (a) a queue Que, initialized with s, (b) flag(v) for each node, initially false; and (c) adjacency lists to store G

Use (a) a queue Que, initialized with s, (b) flag(v) for each node, initially false; and (c) adjacency lists to store G

Why do we need the test?Why do we need the test?

Breadth-first, by using a queue Breadth-first, by using a queue

Complexity: each node and edge is examined at most once

12

Breadth-first search

How to strike a balance?12

BFS (G, s, t): O(|V| + |E|) time and space


a. v Que.dequeue();

b. if v = t then return true;

c. for all adjacent edges e = (v, u) of v do

a) if not flag(u)

then flag(u) true; enqueue u onto Que;

2. return false

Reachability: NL-complete

O(1) time?

Too costly as a routine operation when G is largeToo costly as a routine operation when G is large

Yes, adjacency matrix, but O(|V|2) space

13

2-hop cover

A number of algorithms for reachability queries (see reading list)13

For each node v in G,

2hop(v) = (Lin(v), Lout(v))

•Lin(v): a set of nodes in G that can reach v

•Lout(v): a set of nodes in G that v can reach To ensure: node s can reach t if and only if

Lout(s) Lin(t) •Testing: better than O(|V| + |E|) on average •Space: O(|V| |E|1/2)

Find a minimum 2-hop cover? NP-hardFind a minimum 2-hop cover? NP-hard

Maintenance cost in response to changes to G? Left as a project (LN 7)Maintenance cost in response to changes to G? Left as a project (LN 7)

14

Distance queries

14

Distance: single-source shortest-path problem• Input: A directed weighted graph G, and a node s in G• Output: The lengths of shortest paths from s to all nodes in G

Application: transportation networks

Dijkstra (G, s, w):

1. for all nodes v in V do

a. d[v] ;

2. d[s] 0; Que V;


a. u ExtractMin(Que);

b. for all nodes v in adj(u) do

a) if d[v] > d[u] + w(u, v) then d[v] d[u] + w(u, v);

Use a priority queue Que; w(u, v): weight of edge (u, v); d(u): the distance from s to u

Use a priority queue Que; w(u, v): weight of edge (u, v); d(u): the distance from s to u

Extract one with the minimum d(u)Extract one with the minimum d(u)

Example: Dijkstra’s algorithm

15

0s

∞

∞

∞

∞

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {s,a,b,c,d}d: {(a,∞), (b,∞), (c,∞), (d,∞)}

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28

16

0s

10

5

∞

∞

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {a,b,c,d} d: {(a,10), (b,∞), (c,5), (d,∞)}

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28


17

0s

8

5

14

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {a,b,d} d: {(a,8), (b,14), (c,5), (d,7)}

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28


18

0s

8

5

13

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {a,b}d: {(a,8), (b,13), (c,5), (d,7)}

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28


19

0s

8

5

9

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {b} d: {(a,8), (b,9), (c,5), (d,7)}

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28


20

0s

8

5

9

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {} d: {(a,8), (b,9), (c,5), (d,7)}

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28


O(|V| log|V| + |E|). A beaten-to-death topic ?

How to speed it up by leveraging parallelism?How to speed it up by leveraging parallelism?Shortest paths in terriain? New spatial constraintsShortest paths in terriain? New spatial constraints

21

Why do we care about regular path queries?21

Regular simple path• Input: A node-labeled directed graph G, a pair of nodes s and t

in G, and a regular expression R• Question: Does there exist a simple path p from s to t such

that the labels of adjacent nodes on p form a string in R?

Regular path queries

What is the complexity?

NP-complete, even when R is a fixed regular expression (00)* or 0*10*.

In PTIME when G is a DAG (directed acyclic graph)

A. Mendelzon, and P. T. Wood. Finding Regular Simple Paths In Graph Databases. SICOMP 24(6), 1995

Patterns of social linksPatterns of social links

22Graph queries are nontrivial, even for path queries

22

Regular path• Input: A node-labeled directed graph G, a pair of nodes s and t

in G, and a regular expression R• Question: Does there exist a path p from s to t such that the

labels of adjacent nodes on p form a string in R?

Regular path queries


PTIME

Show that the regular path problem is in O(|G| |R|) timeShow that the regular path problem is in O(|G| |R|) time

23O(|V| + |E|)

23

A strongly connected component in a direct graph G is a set V of nodes in G such that

• for any pair (u, v) in V, u can reach v and v can reach u; and • V is maximal: adding any node to V makes it no longer strongly

connected

Strongly connected components


by extending search algorithms, e.g., BFSby extending search algorithms, e.g., BFS

SCC• Input: A graph G• Question: all strongly connected components of G

Find social circles: how large? How many?Find social circles: how large? How many?

PageRank

2424

25

Introduction to PageRank

To measure the “quality” of a Web page• Input: A directed graph G modelling the Web, in which nodes

represent Web pages, and edges indicate hyperlinks • Output: For each node v in G, P(v): the likelihood that a

random walk over G will arrive at v.

25

Intuition: how a random walk can reach v?• A random jump: (1/|V|)

• : random jump factor (teleportation factor)

• Following a hyperlink: (1 - ) _(u L(v)) P(u)/C(u)

• (1 - ): damping factor

• (1 - ) _(u L(v)) P(u)/C(u): the chances for one to click a hyperlink at a page u and reach v

The chances of hitting v among |V| pages The chances of hitting v among |V| pages

Two parts

26

Intuition

26

Following a hyperlink: (1 - ) _(u L(v)) P(u)/C(u)

• L(v): the set of pages that link to v; • C(u): the out-degree of node u (the number of links on u)• P(u)/C(u):

• the probability of u being visited itself• the probability of clicking the link to v among C(u) many links

on page u

The chances of reaching page v from page uThe chances of reaching page v from page u

Intuition: • the more pages link to v, and • the more popular those pages that link to v,

v has a higher chance to be visited

One of the models

27

Putting together

27

The likelihood that page v is visited by a random walk:

P(v) = (1/|V|) + (1 - ) _(u L(v)) P(u)/C(u)

Recursive computation: for each page v in G,• compute P(v) by using P(u) for all u L(v)

until• converge: no changes to any P(v)• after a fixed number of iterations

random jumprandom jump following a link from other pagesfollowing a link from other pages

too expensive; use an error factortoo expensive; use an error factor

How to speed it up?

costly: trillions of pagescostly: trillions of pages Parallel computationParallel computation

Example: PageRank

28

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28

(d, 0.2)

(a, 0.2)

(b, 0.2)

(c, 0.2)

(e, 0.2)

0.1

0.1

0.10.1

0.2

0.066

0.066

0.066

0.2

Initial stage (assume = 0, P(v) = 0.2)

3-way split3-way split

Example: PageRank

29

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28

(a, 0.066)

(b, 0.166)

(c, 0.166)

(d, 0.3)

(e, 0.3)

(a, 0.2)(b, 0.2)

(c, 0.2)(e, 0.2)

0.1

0.1

0.10.1

0.2

0.066

0.066

0.066

0.2

Iteration 1

Example: PageRank

30

Exa

mpl

e fr

om C

LR

2nd

ed.

p. 5

28

(a, 0.066)(b, 0.166)

(c, 0.166)

(d, 0.3)

(e, 0.3)

0.033

0.033

0.0830.083

0.166

0.1

0.1

0.1

0.3

Iteration 2

(a, 0.1)(b, 0.133)

(c, 0.183)

(d, 0.2)

(e, 0.383)

Find nearest neighbors

3131

32

Nearest neighbor

Nearest neighbor (kNN)• Input: A set S of points in a space M, a query point p in M, a

distance function dist(u, v), and a positive integer k• Output: Find top-k points in S that are closest to p based on

dist(p, u)

A number of techniques 32

Applications• POI recommendation: find me top-k restaurants close to where

I am • Classification: classify an object based on its nearest neighbors• Regression: property value as the average of the values of its k

nearest neighbors Linear search, space partitioning, locality sensitive hashing, compression/clustering based search, …

Euclidean distance, Hamming distance, continuous variables, …

33

kNN join

kNN join• Input: Two datasets R and S, a distance function dist(r, s),

and a positive integer k• Output: pairs (r, s) for all r in R, where s is in S, and is one of

the k-nearest neighbors of r

Can we do better?33

A naive algorithm• Scanning S once for each object in R• O(|R| |S|): expensive when R or S is large

Pairwise comparison

3434

Blocking and windowing

Several indexing and ordering techniques

only pairs in the same block are

compared

windowing

D B2B1

B3 partitioning

pairs into blocks

D D sorting

slidingwindow

window of a fixed size; only pairs in the same window are compared;

blocking

Grid order: rectangle cells, sorted by surrounding points GORDER: An Efficient Method for KNN Join

Processing. VLDB 2004.

Keyword search

3535

Keyword search

36

Ford company

Michigan , city history

Jaguar XJ

USA

result 1

Black Jaguar

history

North America

habitat

result 2

offer

New York, city

United States

Jaguar XK 001

result 3

White Jaguar

historyhabitat

South America

result 4

offer

Chicago

USA

Jaguar XK 007

result 5

Michigan

Query Q: ‘[Jaguar’, ‘America’, ‘history’]

Input: A list Q of keywords, a graph G, a positive integer k Output: top-k “matches” of Q in G

Information retrieval

What makes a match? How to sort the matches? How to efficiently find top-k matches?

Questions to answer

37

Semantics: Steiner tree

What can we do about it?37

Match: a subtree T of G such that • each keyword in Q is contained in a leaf of T

Ranking: • The total weight of T (the sum of w(e) for all edges e in T)

Input: A list Q of keywords, a graph G, a weight function w(e) on the edges on G, and a positive integer k

Output: top-k Steiner trees that match Q PageRank scores

The cost to connect the keywords Complexity?

NP-complete

38

Semantics: distinct-root (tree)

O(|Q| (|V| log |V| + |E|))38

Match: a subtree T of G such that • each keyword in Q is contained in a leaf of T

Ranking: • dist(r, q): from the root of T to a leaf q• The sum of distances from the root to all leaves of T

Input: A list Q of keywords, a graph G, and a positive integer k Output: top-k distinct trees that match Q

To avoid too homogeneous result, e.g., a hub as a root Diversification:

• each match in the top-k answer has a distinct rootHow many candidate matches for root?

39

Semantics: Steiner graphs

Revision: minimum subgraphs39

Match: a subgraph G’ of G such that it is • r-radius: the shortest distance between any pair of nodes in G is

at most r (at least one pair with the distance); and• each keyword is contained in a content node • a Steiner node: on a simple path between a pair of content nodes

Input: A list Q of keywords, an undirected (unweighted) graph G, a positive integer r, and a positive integer k

Output: Find all r-radius Steiner graphs that match Q

Computation: Mr, the r-th power of adjacency graph of G

40

Answering keyword queries

A well studied topic 40

G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. ICDE 2002.

V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. VLDB 2005.

H. He, H. Wang, J. Yang, and P. S. Yu. BLINKS: ranked keyword searches on graphs. SIGMOD 2007.

A host of techniques• Backward search• Bidirectional search• Bi-level indexing• …

However, …

41

Ford company

Michigan , city history

Jaguar XJ

USA

result 1

Black Jaguar

history

North America

habitat

result 2

offer

New York, city

United States

Jaguar XK 001

result 3

White Jaguar

historyhabitat

South America

result 4

offer

Chicago

USA

Jaguar XK 007

result 5

Michigan

Query Q: ‘[Jaguar’, ‘America’, ‘history’]

The semantics is rather “ad hoc”

What does the user really want to find? Tree or graph? How to

explain matches found?

Add semantics to keyword search

company

dealer

history

Americacountry

city

cars

A long way to go

Summing up

4242

43

Graph query languages

Unfortunately, no “standard” query language for social graphs, yet43

SoQL: an SQL-like language to retrieve paths

CRPQ: extending conjunctive queries with regular path expressions• R. Ronen and O. Shmueli. SoQL: A language for querying and creating

data in social networks. ICDE, 2009.• P. Barceló, C. A. Hurtado, L. Libkin, and P. T. Wood. Expressive

languages for path queries over graph-structured data. In PODS, 2010

There have been efforts to develop graph query languages.Querying topological structures

SPARQL: for RDF data• http://www.w3.org/TR/rdf-sparql-query/

Read this

44

Summary and review

Why are reachability queries? Regular path queries? Connected

components? Complexity? Algorithms?

What are factors for PageRank? How does PageRank work?

What are kNN queries? What is a kNN join? Complexity?

What are keyword queries? What is its Steiner tree semantics?

Distinct-root semantics? Steiner graph semantics? Complexity?

Name a few applications of graph queries you have learned.

Find graph queries that are not covered in the lecture.

45

Project (1)

Recall regular path queries:

45

Develop two algorithms for evaluating regular path queries:• a sequential algorithm by using 2-hop covers; and • an algorithm in MapReduce

Prove the correctness of your algorithms and give complexity analysisExperimentally evaluate your algorithms, especially their scalability

• Input: A node-labeled directed graph G, a pair of nodes s and t in G, and a regular expression R

• Question: Does there exist a path p from s to t that satisfies R?

A development project

46

Project (2)

GPath. Extend XPath to query directed, node-labeled graphs.

46

Design GPath, a query language for graphs. A GPath query Q starts

from a context node v in a graph G, traverses G and returns all the

nodes that are reachable from v by following Q. GPath should

support the child axis, wildcard *, self-or-descendants (//), and filters

(aka qualifiers, such as [p = c]). Justify your design.

Develop an algorithm that, given a GPath query Q, a graph G, and

a context node v in G, computes Q(G), the set of nodes reachable

from v in G by following Q.

Give a complexity analysis of your algorithm and show its

correctness.

Experimentally evaluate your algorithm

A research project

47

Project (3)

Study keyword search in graphs.

47

Pick a semantics for keyword search and an algorithm for

implementing keyword search based on the semantics

Justify your choice: semantics and algorithm

Implement the algorithm in whatever language you like

Experimentally evaluate your implementation

Demonstrate your keyword search support

A development project

Reading

48

• A. Mendelzon, and P. Wood. Finding regular simple paths in graph databases. SICOMP, 24(6), 1995. ftp://ftp.db.toronto.edu/pub/papers/sicomp95.ps.Z

• J. Xu and J. Chang. Graph reachability queries: a survey. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.352.2250

• J. Xu, L. Qin, and J. Chang. Keyword Search in Relational Databases: A Survey.

http://sites.computer.org/debull/A10mar/yu-paper.pdf

Acknowledgments: some animation slides are borrowed from

www.cs.kent.edu/~jin/Cloud12Spring/GraphAlgorithms.pptx

Papers for you to review

49

• E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. SODA 2002. http://www.cs.tau.ac.il/~heran/cozygene/publications/papers/labels.pdf

• R. Bramandia, B. Choi, and W. Ng. On Incremental Maintenance of 2-hop Labeling of Graphs. WWW 2008.

http://wwwconference.org/www2008/papers/pdf/p845-bramandia.pdf

• M. Kaul, R.Wong, B. Yang, C. Jensen. Finding Shortest Paths on Terrains by Killing Two Birds with One Stone. VLDB 2014. http://www.vldb.org/pvldb/vol7/p73-kaul.pdf

• C. Xia, H. Lu, B. Ooi and J. Hu. GORDER: An Efficient Method for KNN Join Processing. VLDB 2004. http://www.vldb.org/conf/2004/RS20P2.PDF

Papers for you to review

50

• G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. ICDE 2002.

http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/BanksICDE2002.pdfR• V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H.

Karambelkar. Bidirectional expansion for keyword search on graph databases. VLDB 2005.

http://www.researchgate.net/publication/221310807_Bidirectional_Expansion_For_Keyword_Search_on_Graph_Databases

•H. He, H. Wang, J. Yang, and P. S. Yu. BLINKS: ranked keyword searches on graphs. SIGMOD 2007.

http://db.cs.duke.edu/papers/2007-SIGMOD-hwyy-kwgraph.pdf

•Y. Wu, S. Yang, M. Srivatsa, A. Iyengar, X. Yan. Summarizing Answer Graphs Induced by Keyword Queries.. VLDB 2014. http://www.cs.ucsb.edu/~yinghui/mat/papers/gsum_vldb14.pdf

Download - 1 QSX: Querying Social Graphs Graph Queries and Algorithms Graph search (traversal) PageRank Nearest neighbors Keyword search Graph pattern matching

Top Related