web mining and social network analysis -...

36
Analisi di reti sociali - Aprile 2012 Web mining and Social Network Analysis Dino Pedreschi [email protected] Lecture 2 – Graphs and networks DP1

Upload: others

Post on 05-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2012

Web mining and Social Network AnalysisDino Pedreschi

[email protected]

Lecture 2 – Graphs and networks

DP1

Page 2: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Diapositiva 1

DP1 Dino Pedreschi; 15/04/2011

Page 3: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2011

““““Natural” Networks and Universality

� Consider many kinds of networks:

� social, technological, business, economic, content,…

� These networks tend to share certain informal properties:

� large scale; continual growth

� distributed, organic growth: vertices “decide” who to link to

� interaction restricted to links

� mixture of local and long-distance connections

� abstract notions of distance: geographical, content, social,…

� Do natural networks share more quantitative universals?

� What would these “universals” be?

� How can we make them precise and measure them?

� How can we explain their universality?

� This is the domain of social network theory

� Sometimes also referred to as link analysis

Page 4: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Peter Mary

Albert

Albert

co-worker

friendbrothers

friend

Protein 1 Protein 2

Protein 5

Protein 9

Movie 1

Movie 3

Movie 2

Actor 3

Actor 1 Actor 2

Actor 4

N=4

L=4

Graphs as common language

Analisi di reti sociali - Aprile 2011

Page 5: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

•The choice of the proper network representation

determines our ability to use network theory

successfully.

• In some cases there is a unique, unambiguous

representation.

•In other cases, the representation is by no means

unique.

•For example, for a group of individuals, the way you

assign the links will determine the nature of the

question you can study.

Choosing the proper representation

Analisi di reti sociali - Aprile 2011

Page 6: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

If you connect individuals

that work with each other,

you will explore

the professional network.

CHOOSING A PROPER REPRESENTATION

Analisi di reti sociali - Aprile 2011

Page 7: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

If you connect those that

have a sexual relationship,

you will be exploring the

sexual networks.

CHOOSING A PROPER REPRESENTATION

Analisi di reti sociali - Aprile 2011

Page 8: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

If you connect individuals based on their first name

(all Peters connected to each other), you will be

exploring what?

It is a network, nevertheless.

CHOOSING A PROPER REPRESENTATION

Network Science: Graph Theory January 24, 2011

Analisi di reti sociali - Aprile 2011

Page 9: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

3

Aij =

0 1 1 0

1 0 1 1

1 1 0 0

0 1 0 0

Aii =0 Aij =Aji

L=1

2Aij

i, j=1

N

∑ <k>=2L

N

Aij =

0 1 0 0

0 0 1 1

1 0 0 0

0 0 0 0

Aii = 0 Aij ≠ A ji

L = Aiji, j=1

N

∑ < k >=L

N

GRAPHOLOGY 1

Undirected Directed

14

23

2

14

Actor network, protein-protein interactions WWW, citation networks

Analisi di reti sociali - Aprile 2011

Page 10: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Aij =

0 1 1 0

1 0 1 1

1 1 0 0

0 1 0 0

Aii = 0 Aij = A ji

L =1

2Aij

i, j=1

N

∑ < k >=2L

N

Aij =

0 2 0.5 0

2 0 1 4

0.5 1 0 0

0 4 0 0

Aii = 0 Aij = A ji

L =1

2nonzero(Aij )

i, j=1

N

∑ < k >=2L

N

GRAPHOLOGY 2

Unweighted(undirected)

Weighted(undirected)

3

14

23

2

14

protein-protein interactions, www Call Graph, metabolic networks

Analisi di reti sociali - Aprile 2011

Page 11: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Aij =

0 1 1 1

1 0 1 1

1 1 0 1

1 1 1 0

Aii = 0 Ai≠ j =1

L = Lmax =N(N −1)

2< k >= N −1

GRAPHOLOGY 4

Complete Graph(undirected)

3

14

2

Actor network, protein-protein interactions

Analisi di reti sociali - Aprile 2011

Page 12: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2011

The key basic quantities

� Degree distribution: about connectivity

� what is the typical degree in the network?

� what is the overall distribution?

� Network diameter: about social distance

� maximum (worst-case) or average?

� exclude infinite distances? (disconnected components)

� the small-world phenomenon

� Clustering : about social transitivity

� to what extent that links tend to cluster “locally”?

� what is the balance between local and long-distance connections?

� what roles do the two types of links play?

� Connected components: about social partitioning

� how many, and how large?

Page 13: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Degree distribution

� The degree of a vertex in a network is the number of edges incident on (i.e., connected to) that vertex.

� pk = the fraction of vertices in the network that have degree k.

� Equivalently, pk = the probability that a vertex chosen uniformly at random has degree k.

� A plot of pk for any given network can be formed by a histogram of the degrees of vertices.

� This histogram is the degree distribution for the network

Analisi di reti sociali - Aprile 2011

Page 14: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Degree (k)

P(k

)k

Degree Distribution

Analisi di reti sociali - Aprile 2011

Page 15: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Degree distribution

Degree distribution P(k): probability that

a randomly chosen vertex has degree k

Nk = # nodes with degree k

P(k) = Nk / N ➔➔➔➔ plot

k

P(k)

1 2 3 4

0.1

0.2

0.3

0.4

0.5

0.6

Analisi di reti sociali - Aprile 2011

Page 16: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Size of Cities

Nu

mb

er

of

Cit

ies

Tokyo

∼30 million

New York,

Mexico City

∼15 million

4 x 8 million

cities

16 x 4 million

cities

P∼1/x

There is an equivalent number of people living in cities of all sizes!

Analisi di reti sociali - Aprile 2011

Page 17: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

∼ $50 billion

After Bill enters the arena the average income of the public ∼ ∼ ∼ ∼ USD $1,000,000

Analisi di reti sociali - Aprile 2011

Page 18: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Degree distributions for six networks

Analisi di reti sociali - Aprile 2011

Page 19: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2011

Actor Connectivity (power law)

Nodes: actors

Links: cast jointly

N = 212,250 actors

⟨⟨⟨⟨k⟩⟩⟩⟩ = 28.78

P(k) ~k-γγγγ

Days of Thunder (1990) Far and Away (1992) Eyes Wide Shut (1999)

γγγγ=2.3

Page 20: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2011

Science Citation Index (power law)

(γγγγ = 3)

Nodes: papers

Links: citations

(S. Redner, 1998)

P(k) ~k-γγγγ

2212

25

1736 PRL papers (1988)

Witten-Sander

PRL 1981

Page 21: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2011

Sex-Web (power law)

Nodes: people (Females; Males)

Links: sexual relationships

Liljeros et al. Nature 2001

4781 Swedes; 18-74;

59% response rate.

Page 22: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

A path is a sequence of nodes in which each node is adjacent to the next one

Pi0,in of length n between nodes i0 and in is an ordered collection of n+1 nodes and n links

Pn = {i0,i1,i2,...,in} Pn = {(i0 ,i1),(i1,i2 ),(i2 ,i3 ),...,(in−1,in )}

•A path can intersect itself and pass through the same

link repeatedly. Each time a link is crossed, it is counted

separately

•A legitimate path on the graph on the right:

ABCBCADEEBA

• In a directed network, the path can follow only the

direction of an arrow.

PATHS

A

B

C

D

E

Analisi di reti sociali - Aprile 2011

Page 23: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

A

B

Distance Between A and B?

Analisi di reti sociali - Aprile 2011

Page 24: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

The distance (shortest path, geodesic path) between two

nodes is defined as the number of edges along the shortest

path connecting them.

*If the two nodes are disconnected, the distance is infinity.

In directed graphs each path needs to follow the direction of

the arrows.

Thus in a digraph the distance from node A to B (on an AB

path) is generally different from the distance from node B to A

(on a BCA path).

Network Science: Graph Theory January 24, 2011

DISTANCE IN A GRAPH Shortest Path, Geodesic Path

DC

A

B

DC

A

B

Analisi di reti sociali - Aprile 2011

Page 25: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Diameter: the maximum distance between any pair of nodes in the graph.

Average path length/distance for a direct connected graph (component)

or a strongly connected (component of a) digraph.

where lij is the distance from node i to node j

In an undirected (symmetrical) graph lij =lji, we only need to count them once

l ≡1

2Lmax

liji, j≠i

l ≡1

Lmax

liji, j>i

Network Science: Graph Theory January 24, 2011

NETWORK DIAMETER AND AVERAGE DISTANCE

Lmax =N

2

=N(N−1)

2

Analisi di reti sociali - Aprile 2011

Page 26: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

IT IS A SMALL WORLD

Analisi di reti sociali - Aprile 2011

Page 27: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Stanley Milgram

160 people1 person

Analisi di reti sociali - Aprile 2011

Page 28: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Stanley Milgram found that the average length

of the chain connecting the sender and receiver was of length

5.5.

But only a few chains were ever completed!

Analisi di reti sociali - Aprile 2011

Page 29: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Clustering coefficient:

what portion of your neighbors are connected?

Node i with degree ki

Ci in [0,1]

CLUSTERING COEFFICIENT

Analisi di reti sociali - Aprile 2011

Page 30: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Clustering coefficient: what portion of your

neighbors are connected?

Node i with degree ki

2

1

3

45

7

6

8

9

10

i=8: k8=2, e8=1, TOT=2*1/2=1 ➔ C8=1/1=1

CLUSTERING COEFFICIENT

Analisi di reti sociali - Aprile 2011

Page 31: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Clustering coefficient: what portion of your

neighbors are connected?

Node i with degree ki

i=4: k4=4, e4=2, TOTAL=4*3/2=6 ➔ C4=2/6=1/3

CLUSTERING COEFFICIENT

2

1

3

45

7

6

8

9

10

Analisi di reti sociali - Aprile 2011

Page 32: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Degree distribution: P(k)

Path length: l

Clustering coefficient:

KEY MEASURES

Analisi di reti sociali - Aprile 2011

Page 33: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2011

Transitivity – the clustering coefficient

Page 34: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Basic statisics for some published networks

Analisi di reti sociali - Aprile 2011

Page 35: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

The giant connected component

Analisi di reti sociali - Aprile 2011

Page 36: Web mining and Social Network Analysis - didawiki.di.unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/wma/pedreschi.wmr.2012.02.pdf · Analisi di reti sociali - Aprile 2012 Web mining

Analisi di reti sociali - Aprile 2011

A “Canonical” Natural Network has…

� Few connected components:

� often only 1 or a small number, indep. of network size

� Small diameter:

� often a constant independent of network size (like 6)

� or perhaps growing only logarithmically with network size or even shrink?

� typically exclude infinite distances

� A high degree of clustering:

� considerably more so than for a random network

� in tension with small diameter

� A heavy-tailed degree distribution:

� a small but reliable number of high-degree vertices

� often of power law form