large graph analysis - iscom · • large graphs: social graphs, web graphs … • extremal...

Post on 26-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Large graph analysis

Paola Vocca - Università della Tuscia

Outline

• Large graphs: Social graphs, web graphs …

• Extremal measures: Diameter, centrality, eccentricity, average

distance, separation degree;

• Exact and approximate algorithms and data structure

o exact

o Sampling

o Data stream model: Probabilistic estimator

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 2

Graphs

• Graph allows to represent relations between «thinghs» or

entities

• Nodes or vertices represent the entities

• Edges represent the relation

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 3

Relation: Who is the master of whom?

Bow tie structure of the Web Graph

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 4

• An AltaVista crawl of 200

million pages and 1:5 billion

links.

• A giant strongly connected

component containing 28%

of the nodes.

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins,

and J. Wiener. Graph structure in the Web: experiments and models. Computer

Networks, 33(1–6):309–320, 2000.

Graphs

Dolphin interactions

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 5

Graphs example

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 6

Graph Datasets

Hyperlinks (the Web)

Social graphs (Facebook, Twitter, LinkedIn,…)

Email logs, phone call logs , messages

Commerce transactions (Amazon purchases)

Road networks

Communication networks

Protein interactions

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 7

Properties

Directed/Undirected

Snapshot or with time dimension (dynamic)

One or more types of entities (people, pages, products)

Meta data associated with nodes or edges (labels)

Some graphs are really large: billions of edges for Facebook and Twitter graphs

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 8

Mining the graph

Connected/Strongly connected components

Eccentricity (the maximum distance d(v, u) for all u).

Radius r(G) is the minimum eccentricity of the nodes. A node is central if e(u) = r(G) and the center of G is the set of all central nodes. I

Diameter (longest shortest s-t path)

Effective diameter (90% percentile of pairwise distance)

Distance distribution (number of pairs within each distance)

Average distance

Degree distribution

Clustering coefficient: Ratio of the number of closed triangles to open triangles.

Centrality

…..

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 9

…Mining the link structure

• Centrality (who are the most important nodes?)

• Similarity of nodes (link prediction, targeted ads,

friend/product recommendations, Meta-Data completion)

• Communities: set of nodes that are more tightly related to

each other than to others

• “cover:” set of nodes with good coverage (facility location,

influence maximization)

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 10

Connected components

Number of connected

components 2

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 11

Eccentricity

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 12

Ecc= 2

Ecc= 2

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Diameter

Diameter is 3

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 13

Distance distribution

Distance 1: 27 (number of edges)

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 14

#nodes: 13

#edges: 27

#pairs: 78

1 2

3 4

5

6 7

8

9

10

11

12

13

Distance distribution

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 15

#nodes: 13

#edges: 27

#pairs: 78

Distance 1: 27 (number of edges)

Distance 2: 33

1 2

3 4

5

6 7

8

9

10

11

12

13

Distance distribution

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 16

#nodes: 13

#edges: 27

#pairs: 78

Distance 1: 27 (number of edges)

Distance 2: 33

Distance 3: 18

1 2

3 4

5

6 7

8

9

10

11

12

13

Average distance

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi

17

#nodes: 13

#edges: 27

#pairs: 78

𝑨𝒗𝒈𝑫𝒊𝒔𝒕 =𝟏

𝟕𝟖𝟐𝟕 ∙ 𝟏 + 𝟑𝟑 ∙ 𝟐 + 𝟏𝟖 ∙ 𝟑 ≅ 𝟏, 𝟖𝟖

1 2

3 4

5

6 7

8

9

10

11

12

13

Triangles

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 18

closed triangle 1 2

3 4

5

6 7

8

9

10

11

12

13

• Social graphs have many more closed triangle than random graphs

• “Communities” have more closed triangles

Communities

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 19

1 2

3 4

5

6 7

8

9

10

11

12

13 Star Wars

Ninjago

Communities in Les Miserables Network

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 20

The network of interactions between major characters in the novel Les Miserables by Victor Hugo

Centrality

• Which are the most important nodes ?

– Depends on the criteria and what we want to model

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 21

• Degree (in/out): largest number of followers, friends. Easy to

compute locally. Spammable.

• PageRank: Your importance/ reputation recursively depend

on that of your friends

• Betweenness: Your value as a “hub” -- being on a shortest

path between many pairs.

• Closeness: Centrally located, able to quickly reach/infect many nodes … the inverse of eccentricity

Centrality: example

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 22

1 2

3 4

5

6 7

8

9

10

11

12

13 • Central nodes respect to all criteria

Random graphs and real graphs

• Experiments show that statistical measures in real complex networks are significantly different with respect to random

generated graphs

• Biological networks, social networks, Internet, Web have similar

measures.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 23

High aggregation degree Low separation degree

Clustering coefficient Average distance

Degree distribution

• In a random graph (a link is created according to a uniform probability distribution) all nodes have the same importance: The probability that 𝒗 is connecte to 𝒗’ is the same as it conencted to 𝒗’’ ,

• the probability that a node has exactly degree k is

𝑷 𝒌 =nodes with degree 𝒌Nodes of the graph

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 24

Random graphs Real graphs

• Nodes with low or a high degree are rare • Most of the nodes have a degree in the

average

Degree distribution

• Hubs are shortcuts on the paths so influencing the

separtion degree

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 25

small-world effect: very few nodes with many connections, which ensure the

speed of transmission of information (or

gossip ...) in the network

• Social networks: a few individuals with many friends (celebrities)

• Web: few websites with lots of links

• metabolic networks: a few metabolites participating in many metabolic processes

• Internet: works well 'cause it's true that any two computers are connected by not more than 10 or 20 "hops" (physical links)

Degree distribution

• Hubs are shortcuts on the paths so influencing the

separtion degree

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 26

small-world effect: very few nodes with many connections, which ensure the

speed of transmission of information (or

gossip ...) in the network

small-world effect, crucial • in the study of the epidemic spreading

• In the in the service-optimization problem in telecommunications networks,

• in the study of neuronal interconnections in the brain,

• in ecological networks, etc

Small-world effect

• a network evolves over time, and

each new node joining the network

does not have the same chance to

connect to a node rather than to

another, but

• it is more likely that the new node

connects to an already connected

node, rather than to a node

isolated: preferential attachment

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 27

Why?

hubs are

• Reason for Network Resiliency: if it falls a node case is rare

that it is a hub, then the connectivity remains "guaranteed"

• but they are also the target of targeted attacks!

Small world effect

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 28

Kevin

Bacon Number # of People

0 1

1 3303

2 381495

3 1383150

4 356429

5 30815

6 3640

7 584

8 116

9 26

10 1 Total number of linkable actors: 2159560

Weighted total of linkable actors: 6522634

Average Kevin Bacon number: 3.020

The average Bacon number is 3.020

How good a center isKevin Bac

Facebook

• Boldi, Rosa, and Vigna. Hyperanf: approximating the neighbourhood function of very large graphs on a budget. In

WWW 2011.

• The average distance of Facebook (721:1M nodes and 68:7G

edges) is 4.7 and the diameter is 41.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 29

Neighbourhood function

• The neighbourhood 𝑵 𝒕 of a graph returns, for each 𝑡 ∈ ℕ the number of pairs of nodes 𝒙, 𝒚 such that 𝒚 is reachable from

𝒙 in less that 𝑡 steps.

• All the previous measures on graphs can be derived from the

computation of 𝑵 𝒕 .

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 30

Real graphs are huge

The Internet 2003

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi

31

Asia Pacific – Red

Europe/Middle East/Central

Asia/Africa – Green

North America – Blue

Latin American and Caribbean –

Yellow

RFC1918 IP Addresses – Cyan

Unknown – White

Graph Colors:

Figure by the Opte Project (www.opte.org).

The vertices are “class C subnets”: groups of computers with similar Internet addresses, usually managed by a single organization the connections represent the routes taken by data packets as they hop between subnets. The geometric positions of the vertices are chosen simply to give a pleasing layout and are not related to geographic position.

Internet 2010

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 32

Internet 2015

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 33

North America (ARIN)

Europe (RIPE)

Latin America (LACNIC)

Asia Pacific (APNIC)

Africa (AFRINIC)

“Backbone” (highly

connected networks)

Graph Colors:

Exact computations

How can one compute the distance distribution?

• Weighted graphs:

o Dijkstra (single-source: 𝑶(𝒏𝟐)),

o Floyd-Warshall (all-pairs: 𝑶 𝒏𝟑

• unweighted graphs:

o a single BFS solves the single-source version of the problem: 𝑶 𝒎

• if we repeat it from every source: 𝑶 𝒎𝒏

• Matrix multiplication Still too expensive.

o 𝑶(𝒏𝟑+𝝎

𝟐 𝐥𝐨𝐠 𝒏 ) where 𝜔 is the exponent of the matrix

multiplication.

o U. Zwick. All pairs shortest paths using bridging sets and rectangular matrix

multiplication. J. ACM, 49(3):289–317, 2002.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 34

Computing on Very Large Graphs

Sampling

Approximation

Probabilistic counter

General algorithm design principles :

keep total computation/ communication/ storage “linear” in the size of the data

Parallelize (minimize chains of dependencies)

Localize dependencies

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 35

Sampling

• Strategy o Sample at random a source x

o Compute a full BFS from x

• It is an unbiased estimator only for undirected and connected graphs o Uses anyway BFS...

o ...not cache friendly

o ...not compression friendly

The average distance can obtained by sampling for each node only 𝑶𝒍𝒐𝒈 𝒏

𝜺𝟐

random nodes and not all n nodes with an error of 𝜀, reducing to

𝑶𝒍𝒐𝒈 𝒏

𝜺𝟐(𝒏 log 𝒏 +𝒎 the time complexity.

David Eppstein and Joseph Wang. 2001. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (SODA '01). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 228-229.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 36

Sampling

Sampling algorithms • Sampling by random node selection: Authors in showed that RN does

not retain power-law o Vertex choice can be made

• proportional to its PageRank

• Random Degree Node (RDN) sampling has even more bias towards high degree nodes.

• Sampling by random edge selection: sampled graphs will be very sparsely connected and will thus have large diameter and will not respect community structure.

• Sampling by exploration o Random Node Neighbor (RNN)

o Random Walk (RW)

o Random Jump (RJ)

Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '06). ACM, New York, NY, USA, 631-636

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 37

Diffusion

Basic idea • ANF: • Christopher R. Palmer, Phillip B. Gibbons, and Christos Faloutsos. 2002. ANF: a fast

and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). ACM, New York, NY, USA, 81-90.

• HyperANF: Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634.

• Let 𝑩𝒕 𝒙 be the ball of radius 𝒕 about x (the set of nodes at distance = 𝒕 from 𝒙)

• Clearly 𝑩𝒐(𝒙) = {𝒙} • Moreover 𝑩𝒕+𝟏(𝒙) = 𝑩𝒕(𝒚) ∪𝒙⟶𝒚 {𝒙}

• So computing 𝑩𝒕+𝟏 starting from 𝑩𝒕 just need a single (sequential) scan of the graph

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 38

Example

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39 3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39

𝒙

𝒙𝟏 𝒙𝟐

𝒙𝟑

𝑩𝟏(𝒙𝟑)

𝑩𝟏(𝒙𝟐)

𝑩𝟏(𝒙𝟏)

𝑩𝟐(𝒙)

Easy but expensive

• Every set requires 𝑶(𝒏) bits, hence 𝑶(𝒏𝟐) bits overall

• Too many!

• What about using approximated sets?

• We need probabilistic counters, with just two primitives: add

and size?

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 40

ANF e HyperANF

• ANF: use the probabilistic counter of Flajolet&Martin (1985)

implemented in the framework SNAP

• HyperANF : used HyperLogLog counters [Flajolet et al., 2007]

and implemented in the framework WebGraph to study the

web graph.

o With 40 bits you can count up to 4 billion with a

o standard deviation of 6%

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 41

Probabilistic counter: Streaming model

Sequence of elements from some domain

<x1, x2, x3, x4, ..... >

Bounded storage:

working memory << stream size

usually 𝑶(𝒍𝒐𝒈𝒌𝒏) or 𝑶(𝒏𝜶) for 𝜶 < 𝟏

Fast processing time per stream element

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 42

Counting Distinct Elements

Keys occur multiple times, we want to count the number of

distinct keys in the stream

In this example:

Number of distinct key is 𝒏 = 𝟔

Number of stream elements is 11

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 43

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Distinct Elements: Approximate Counting

Exact counting of 𝑛 distinct element requires a structure of size

Ω 𝑛

We are often happy with an approximate count obtained using a

small-size working memory.

We want to be able to compute and maintain a small sketch 𝒔(𝑵) of the set 𝑁 of distinct items seen so far 𝑵 = {𝟑𝟐, 𝟏𝟐, 𝟏𝟒, 𝟕, 𝟔, 𝟒}

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 44

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Wanted: Distinct Elements Sketch Small size s(𝑁) ≪ 𝑁 = 𝑛

Can query 𝐬(𝐍) to get a good estimate 𝒏 (𝒔) of 𝑛 (small relative

error)

Streaming: For a new element 𝑥, easy to compute s(𝑁 ∪ 𝑥) from

s 𝑁 and 𝑥

Mergeability: If 𝑁1 and 𝑁2 are (possibly overlapping) sets then we

can compute the union sketch from their sketches: 𝑠(𝑁1 ∪ 𝑁2) from

𝑠(𝑁1) and s 𝑁2

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 45

MinHash Sketch: [Flajolet & Martin 85, …]

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

ℎ 𝑥 ∼ 𝑈[0,1] ℎ is a random hash function from keys to uniform random numbers in [0,1]

Maintain the Min-Hash value 𝑦:

Initialize 𝑦 ← 1

Processing an element with key 𝑥:

𝑦 ← min {𝑦, ℎ 𝑥 }

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 46

Distinct Elements: Approximate Counting

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, 𝑥

ℎ(𝑥)

𝑦

𝑛 0

1

1 2 3 3 4 4 4 4 5 5 6

0.45 0.21

0.35 0.92

0.14

0.45 0.45 0.45 0.74

0.35 0.35

0.35 0.35

0.35

0.21 0.21

0.21 0.21 0.21 0.14 0.14

0.14

The minimum hash value 𝑦 = min ℎ x is: Non-increasing and unaffected by repeated elements.

Precise relation: E 𝑦 =1

𝑛+1

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 47

Distinct Elements: Approximate Counting

How does the minimum hash 𝑦 give information on the number of distinct elements 𝑛 ?

0 1

The expectation of the minimum is 𝐄 𝐦𝐢𝐧 𝒉 𝒙 =𝟏

𝒏+𝟏

minimum

A single value gives only limited information. To boost information, we maintain 𝒌 ≥ 𝟏 values

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 48

Advantages

• Scalability: a minimum of 20 bytes per node

• On a 2TiB machine, 100 billion nodes

• The algorithms can be implemented on scalable architecture

for big data (Hadoop, Sparc)

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 49

Camparing approaches

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 50

THANK YOU!!

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 51

top related