five lectures on networks -...

57
3 Aaron Clauset @aaronclauset Assistant Professor of Computer Science University of Colorado Boulder External Faculty, Santa Fe Institute Five Lectures on Networks © 2014 Aaron Clauset lecture 3: describing network position and understanding assortative mixing

Upload: others

Post on 23-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

100150

200250

300

Aaron Clauset @aaronclauset Assistant Professor of Computer Science University of Colorado Boulder External Faculty, Santa Fe Institute

Five Lectures on Networks

© 2014 Aaron Clauset

lecture 3: describing network position and understanding assortative mixing

Page 2: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

100150

200250

300

http://santafe.edu/~aaronc/courses/5352/

Network Analysis and Modeling !Instructor: Aaron Clauset !This graduate-level course will examine modern techniques for analyzing and modeling the structure and dynamics of complex networks. The focus will be on statistical algorithms and methods, and both lectures and assignments will emphasize model interpretability and understanding the processes that generate real data. Applications will be drawn from computational biology and computational social science. No biological or social science training is required. (Note: this is not a scientific computing course, but there will be plenty of computing for science.)

Full lectures notes online (~150 pages in PDF)

Page 4: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

1. defining a network 2. describing a network

3. null models for networks

4. statistical inference

Page 5: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

describing networks

position

Page 6: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

position = centrality: measure of positional “importance”

describing networks

geom

etri

cco

nnec

tivity

Boldi & Vigna, arxiv:1308.2140 (2013) Borgatti, Social Networks 27, 55–71 (2005)

harmonic centralitycloseness centralitybetweenness centralitydegree centralityeigenvector centralityPageRankKatz centralitymany many more…

Page 7: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

position = centrality: harmonic, closeness centrality

importance = being in “center” of the network

describing networks

ci =1

n� 1

X

j 6=i

1

dij

dij =

⇢`ij if j reachable from i1 otherwise

distance:

length of shortest path

harmonic

Boldi & Vigna, arxiv:1308.2140 (2013) Borgatti, Social Networks 27, 55–71 (2005)

Page 8: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

position = centrality: PageRank, Katz, eigenvector centrality

importance = sum of importances of nodes that point at you

or, the left eigenvector of

describing networks

Ii =X

j!i

Ijkj

*

*modulo several technical details

Ax = �x

Boldi & Vigna, arxiv:1308.2140 (2013) Borgatti, Social Networks 27, 55–71 (2005)

Page 9: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

Giovanni de Medici

an example

Page 10: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

Giovanni de MediciPalazzo MediciDuomo

1993

Page 11: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

Pazzi

Salviati

Medici

Acciaiuoli

Albizzi

Ginori

Tornabuoni

Guadagni

Lamberteschi

Barbadori

RidolfiCastellani

Strozzi

PeruzziBischeri

nodes: Florence families edges: inter-family marriages !which family is most central?

network position: closeness

Page 12: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position: closeness

Medici 9.5Guadagni 7.92

Albizzi 7.83Strozzi 7.67Ridolfi 7.25

Bischeri 7.2Tornabuoni 7.17

Barbadori 7.08Peruzzi 6.87

Castellani 6.87Salviati 6.58

Acciaiuoli 5.92Ginori 5.33

Lamberteschi 5.28Pazzi 4.77

P

Sal

Medici

Ac

Alb

Gi

T

Gu

L

B

RC

Str

Pe

B

Page 13: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

actually, it’s complicated...

Page 14: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

an example2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

Boston commuter rail

data

Gastner & Newman, J. Stat. Mech. P01015 (2006)

559 km of rail

Page 15: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

Boston commuter rail

data minimum travel time

Gastner & Newman, J. Stat. Mech. P01015 (2006)

559 km of rail 3272 km of rail

an example

Page 16: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

Boston commuter rail

data minimum travel time

Gastner & Newman, J. Stat. Mech. P01015 (2006)

minimum weight tree

559 km of rail 3272 km of rail 499 km of rail

an example

Page 17: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

Gastner & Newman, J. Stat. Mech. P01015 (2006)

route factor

mean ratio of distance along edges to direct Euclidean distance to root

q =1

n

nX

i=1

`i0di0

`i0di0 0

559 km 3272 km 499 km

q=1q=1.14 q=1.61

Page 18: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

Gastner & Newman, J. Stat. Mech. P01015 (2006)

a simple model

embed vertices in a plane until all vertices connected

add edge with minimum value for

559 km 3272 km 499 km

q=1q=1.14 q=1.61

wij = dij + �`j0

n

(i, j)

distance from toparameter

route length to rooti j

� = 0 minimum spanning tree� > 0 prefer shorter paths to root

*this is exactly Prim’s algorithm for MSTs

*

Page 19: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

Gastner & Newman, J. Stat. Mech. P01015 (2006)

a simple model

embed vertices in a plane until all vertices connected

add edge with minimum value for

559 km 3272 km 499 km

q=1q=1.14 q=1.61

wij = dij + �`j0

n

(i, j)4

FIG. 3: Route factor q and average edge length l as a functionof β for our second model (n = 10 000). Inset: an examplemodel network with β = 0.4.

work. Values of q in the range 1.1 to 1.6 observed in thereal-world networks are easily achieved.

When we look at the shape of the network itself how-ever (see figure inset), we get quite a different story. Thismodel produces a symmetric network that fills space outto some approximately constant radius from the root,not unlike the clusters produced by the well-known Edengrowth model [15]. The second term in Eq. (3) makesit economically disadvantageous to build connections tooutlying areas before closer areas have been connected.Thus all vertices within a given distance of the root areserved by the network, without gaps, which is a morerealistic situation than the dendritic network of Fig. 2.

And this in fact may be the secret of how low routefactors are achieved in reality. Our second model—unlikeour first—does not explicitly aim to optimize the routefactor. But it does a creditable job nonetheless, preciselybecause it fills space radially. The main trunk lines in thenetwork are forced to be approximately straight simplybecause the space to either side of them has already beenfilled and there’s nowhere else to go but outwards.

Readers familiar with urban geography may argue thatreal networks, and the towns they serve, are dendritic inform. And this is true, but it is primarily a consequenceof other factors, such as ribbon development along high-ways. In other words, the initial distribution of verticesin real networks is usually non-uniform, unlike our model.It is interesting to see therefore what happens if we applyour model to a realistic scatter of points, and in Fig. 1dwe have done this for the stations of the Boston rail sys-tem. The figure shows the network generated by oursecond model for β = 0.4 given the real-world positionsof the stations. The result is, with only a couple of ex-ceptions, identical to the true rail network, with a com-

parable route factor of 1.11 and total edge length 511km.To summarize, we have in this paper studied spatial

distribution or collection networks such as pipelines andsewers, focusing particularly on their cost in terms oftotal edge length and their efficiency in terms of the net-work distance between vertices, as measured by the so-called route factor. While these two quantities are, tosome extent, at odds with one another, the first beingdecreased only at the expense of an increase in the sec-ond, our empirical observations indicate that real-worldnetworks find good compromise solutions giving nearlyoptimal values of both. We have presented two models ofspatial networks based on greedy optimization strategiesthat reproduce this behavior well, showing how networkspossessing simultaneously good route factors and low to-tal edge length can be generated by plausible growthmechanisms.

The results presented represent only a fraction of thepossibilities in this area. Numerous other networks fallinto the class studied here, including various utility,transportation, or shipping networks, as well as some bi-ological networks, such as the circulatory system, fungalmycels, and others, and we hope that researchers will feelencouraged to investigate these interesting systems.

The authors thank Jonathan Goodwin and Sean Do-herty for the pipeline network data and the staff of theUniversity of Michigan’s Numeric and Spatial Data Ser-vices for their help. This work was funded in part by theNational Science Foundation under grant number DMS–0234188 and by the James S. McDonnell Foundation.

[1] R. Albert and A.-L. Barabasi, Statistical mechanics ofcomplex networks. Rev. Mod. Phys. 74, 47–97 (2002).

[2] S. N. Dorogovtsev and J. F. F. Mendes, Evolution ofnetworks. Advances in Physics 51, 1079–1187 (2002).

[3] M. E. J. Newman, The structure and function of complexnetworks. SIAM Review 45, 167–256 (2003).

[4] S. H. Yook, H. Jeong, and A.-L. Barabasi, Modelingthe Internet’s large-scale topology. Proc. Natl. Acad. Sci.USA 99, 13382–13386 (2001).

[5] S. P. Gorman and R. Kulkarni, Spatial small worlds: Newgeographic patterns for an information economy. Preprintcond-mat/0310426 (2003).

[6] R. Guimera, S. Mossa, A. Turtschi, and L. A. N. Ama-ral, Structure and efficiency of the world-wide airportnetwork. Preprint cond-mat/0312535 (2003).

[7] G. Csanyi and B. Szendroi, The fractal/small-world dichotomy in real-world networks. Preprintcond-mat/0406070 (2004).

[8] M. Kaiser and C. C. Hilgetag, Spatial growth of real-world networks. Phys. Rev. E 69, 036103 (2004).

[9] M. T. Gastner and M. E. J. Newman, The spatial struc-ture of networks. Preprint cond-mat/0407680 (2004).

[10] W. R. Black, Transportation: A Geographical Analysis.Guilford Press, New York, NY (2003).

[11] T. A. Witten and L. M. Sander, Diffusion-limited aggre-

� = 0.4

h`i2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

511 km

q = 1.11� = 0.4

Page 20: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

most centralized

most decentralized

vast wilderness of in-between

Page 21: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

most centralized

most decentralized

vast wilderness of in-between

Page 22: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

vast wilderness of in-between

most centralized

most decentralized

Page 23: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

vast wilderness of in-between

most centralized

most decentralized

Page 24: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network position

positions:

• geometric description of network structure • core vs. periphery • centrality = importance, influence

open questions:

• position and dynamics • what does position predict? • when does position not matter?

2

FIG. 1: (a) Commuter rail network in the Boston area. The arrow marks the assumed root of the network. (b) Star graph.(c) Minimum spanning tree. (d) The model of Eq. (3) applied to the same set of stations.

of the networks to two theoretical models that are eachoptimal by one of these two criteria. If one is interestedsolely in short, efficient paths to the root vertex then theoptimal network is the “star graph,” in which every ver-tex is connected directly to the root by a single straightedge (see Fig. 1b). Conversely, if one is interested solelyin minimizing total edge length, then the optimal net-work is the minimum spanning tree (MST) (see Fig. 1c).(Given a set of n vertices at specified points on a flatplane, the MST is the set of n − 1 edges joining themsuch that all vertices belong to a single component andthe sum of the lengths of the edges is minimized [17].)

To make the comparison with the star graph, we con-sider the distance from each non-root vertex to the rootfirst along the edges of the network and second along asimple Euclidean straight line, and calculate the meanratio of these two distances over all such vertices. Fol-lowing Ref. [10], we refer to this quantity as the network’sroute factor, and denote it q:

q =1

n

n!

i=1

li0

di0

, (1)

where li0 is the distance along the edges of the networkfrom vertex i to the root (which has label 0), and di0

is the direct Euclidean distance. If there is more thanone path through the network to the root, we take theshortest one. Thus, for example, q = 2 would imply thaton average the shortest path from a vertex to the rootthrough the network is twice as long as a direct straight-line connection. The smallest possible value of the routefactor is 1, which is achieved by the star graph.

The route factors for our four networks are shown inTable I. As we can see, the networks are remarkablyefficient in this sense, with route factors quite close to 1.Values range from q = 1.13 for the Western Australiangas pipelines to q = 1.59 for the sewer system.

We also show in Table I the total edge lengths for eachof our networks, along with the edge lengths for the MSTon the same set of vertices and, as the table shows, weagain find that our real-world networks are competitive

route factor edge length (km)network n actual MST actual MST starsewer system 23 922 1.59 2.93 498 421 102 998gas (WA) 226 1.13 1.82 5 578 4 374 245 034gas (IL) 490 1.48 2.42 6 547 4 009 59 595rail 126 1.14 1.61 559 499 3 272

TABLE I: Number of vertices n, route factor q, and total edgelength for each of the networks described in the text, alongwith the equivalent results for the star graphs and minimumspanning trees on the same vertices. (Note that the routefactor for the star graph is always 1 and so has been omittedfrom the table.)

with the optimal model, the combined edge lengths ofthe real networks ranging from 1.12 to 1.63 times thoseof the corresponding MSTs.

But now consider the remaining two columns in thetable, which give the route factors for the MSTs and thetotal edge lengths for the star graphs. As the table shows,these figures are for all networks much poorer than theoptimal case and, more importantly, much poorer thanthe real-world networks too. Thus, although the MSTis optimal in terms of total edge length it is very poorin terms of route factor and the reverse is true for thestar graph. Neither of these model networks would be agood general solution to the problem of building an ef-ficient and economical distribution network. Real-worldnetworks, on the other hand, appear to find a remarkablygood compromise between the two extremes, possessingsimultaneously the benefits of both the star graph andthe minimum spanning tree, without any of the flaws. Inthe remainder of the paper we consider mechanisms bywhich this might occur.

The networks we are dealing with are not, by and large,designed from the outset for global optimality (or near-optimality) of either their total edge length or their routefactors. Instead, they form by growing outward from theroot, as the population they serve swells and infrastruc-ture is extended and improved. To explore the possi-bilities of this process we consider a situation in which

P

Sal

Medici

Ac

Alb

Gi

T

Gu

L

B

RC

Str

Pe

B

Page 25: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

describing networks

homophily and assortative mixing like links with like

Page 26: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

homophily and assortative mixing like links with like !assortativity coefficient quantifies homophily three types:

scalar attributes vertex degrees categorical variables

~vj~vi

vertex attributes

r

Newman, Phys. Rev. E 67, 026126 (2003).

Page 27: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

homophily and assortative mixing like links with like !scalar attributes:

mean value across ties

~vi~vj

µ =1

2m

X

i

X

j

Aijvi

=1

2m

X

i

kivi

Newman, Phys. Rev. E 67, 026126 (2003).

Page 28: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

homophily and assortative mixing like links with like !scalar attributes:

covariance across ties

~vi~vj

cov(vi, vj) =

Pij Aij(vi � µ)(vj � µ)

Pij Aij

=

1

2m

X

ij

Aijvivj � µ2

=

1

2m

X

ij

✓Aij �

kikj2m

◆vivj

µ =

1

2m

X

i

kivi

!

Newman, Phys. Rev. E 67, 026126 (2003).

Page 29: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

homophily and assortative mixing like links with like !assortativity coefficient (scalar)

[this is just a Pearson correlation across edges]

~vi~vj

r =

cov(vi, vj)

var(vi, vj)

=

Pij (Aij � kikj/2m) vivjP

ij ki�ij � kikj/2m

Newman, Phys. Rev. E 67, 026126 (2003).

�1 r 1

Page 30: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

~vi~vj

Newman, Phys. Rev. E 67, 026126 (2003).

5

10 20 30 40 50

age of husband [years]

10

20

30

40

age

of w

ife [

year

s]

-5 0 5 10 15 20 25

age difference [years]

0

50

100

150

200

num

ber

FIG. 1: Top: scatter plot of the ages of 1141 married couplesat time of marriage, from the 1995 US National Survey ofFamily Growth [37]. Bottom: a histogram of the age differ-ences (male minus female) for the same data.

In Fig. 1 (top panel) we show a scatter plot of the agesof marriage partners in the 1995 US National Survey ofFamily Growth [37]. As is clear from the figure, there isa strong positive correlation between the ages, with mostof the density in the distribution lying along a roughdiagonal in the plot; people, it appears, prefer to marryothers of about the same age, although there is somebias towards husbands being older than their wives. Inthe bottom panel of the same figure we show a histogramof the age differences in the study, which emphasizes thesame conclusion [76].

By analogy with the developments of Section II, we candefine a quantity exy, which is the fraction of all edgesin the network that join together vertices with values xand y for the age or other scalar variable of interest. Thevalues x and y might be either discrete in nature (e.g., in-tegers, such as age to the nearest year) or continuous (ex-act age), making exy either a matrix or a function of twocontinuous variables. Here, for simplicity, we concentrateon the discrete case, but generalization to the continuouscase is straightforward.

As before, we can use the matrix exy to define a mea-sure of assortativity. We first note that exy satisfies thesum rules!

xy

exy = 1,!

y

exy = ax,!

x

exy = by, (20)

where ax and by are, respectively, the fraction of edgesthat start and end at vertices with values x and y. (On anundirected, unipartite graph, ax = bx.) Then, if there isno assortative mixing exy = axby. If there is assortativemixing one can measure it by calculating the standardPearson correlation coefficient thus:

r =

"

xy xy(exy − axby)

σaσb, (21)

where σa and σb are the standard deviations of the dis-tributions ax and by. The value of r lies in the range−1 ≤ r ≤ 1, with r = 1 indicating perfect assortativityand r = −1 indicating perfect disassortativity (i.e., per-fect negative correlation between x and y). For the agedata from Fig. 1, for example, we find that r = 0.574,indicating strong assortative mixing once more.

One can construct in a straightforward manner a ran-dom graph model of a network with this type of mixingexactly analogous to the model presented in Section II B.It is also possible to generate random representative net-works from the ensemble defined by exy using the algo-rithm described in Section II C. In this paper however,rather than working further on the general type of mixingdescribed here, we will concentrate on one special exam-ple of assortative mixing by a scalar property which isparticularly important for many of the networks we areinterested in, namely mixing by vertex degree.

A. Mixing by vertex degree

In general, scalar assortative mixing of the type de-scribed above requires that the vertices of the network ofinterest have suitable scalar properties attached to them,such as age or income in social networks. In many cases,however, data are not available for these properties toallow us to assess whether the network is assortativelymixed. But there is one scalar vertex property that isalways available for every network, and that is vertexdegree. So long as we know the network structure wealways know the degree of a vertex, and then we canask whether vertices of high degree preferentially asso-ciate with other vertices of high degree. Do the gregari-ous people hang out with other gregarious people? Thishas been a topic of considerable discussion in the physicsliterature [38, 39, 40, 41, 42]. As we will show, manyreal-world networks do show significant assortative (ordisassortative) mixing by vertex degree.

Assortative mixing by degree can be quantified in ex-actly the same way as for other scalar properties of ver-tices, using Eq. (21). Taking the example of an undi-rected network and using the notation of Ref. 22, we

(top) scatter plot of ages of 1141 married couples at time of marriage [1995 US National Survey of Family Growth]

(bottom) histogram of age differences (M-F) for same data

r = 0.574strongly assortative

Page 31: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

kikj

assortative mixing

homophily and assortative mixing like links with like !degree:

just another scalar

Newman, Phys. Rev. E 67, 026126 (2003).* the assortativity coefficient formula simplifies somewhat in this case. see the Ref in the left corner for more details

*

Page 32: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

Newman, Phys. Rev. E 67, 026126 (2003).

7

network type size n assortativity r error σr ref.

social

physics coauthorship undirected 52 909 0.363 0.002 abiology coauthorship undirected 1 520 251 0.127 0.0004 amathematics coauthorship undirected 253 339 0.120 0.002 bfilm actor collaborations undirected 449 913 0.208 0.0002 ccompany directors undirected 7 673 0.276 0.004 dstudent relationships undirected 573 −0.029 0.037 eemail address books directed 16 881 0.092 0.004 f

technological

power grid undirected 4 941 −0.003 0.013 gInternet undirected 10 697 −0.189 0.002 hWorld-Wide Web directed 269 504 −0.067 0.0002 isoftware dependencies directed 3 162 −0.016 0.020 j

biological

protein interactions undirected 2 115 −0.156 0.010 kmetabolic network undirected 765 −0.240 0.007 lneural network directed 307 −0.226 0.016 mmarine food web directed 134 −0.263 0.037 nfreshwater food web directed 92 −0.326 0.031 o

TABLE II: Size n, degree assortativity coefficient r, and expected error σr on the assortativity, for a number of social,technological, and biological networks, both directed and undirected. Social networks: coauthorship networks of (a) physicistsand biologists [43] and (b) mathematicians [44], in which authors are connected if they have coauthored one or more articlesin learned journals; (c) collaborations of film actors in which actors are connected if they have appeared together in one ormore movies [5, 7]; (d) directors of Fortune 1000 companies for 1999, in which two directors are connected if they sit on theboard of directors of the same company [45]; (e) romantic (not necessarily sexual) relationships between students at a US highschool [46]; (f) network of email address books of computer users on a large computer system, in which an edge from user Ato user B indicates that B appears in A’s address book [47]. Technological networks: (g) network of high voltage transmissionlines in the Western States Power Grid of the United States [5]; (h) network of direct peering relationships between autonomoussystems on the Internet, April 2001 [48]; (i) network of hyperlinks between pages in the World-Wide Web domain nd.edu, circa1999 [49]; (j) network of dependencies between software packages in the GNU/Linux operating system, in which an edge frompackage A to package B indicates that A relies on components of B for its operation. Biological networks: (k) protein–proteininteraction network in the yeast S. Cerevisiae [50]; (l) metabolic network of the bacterium E. Coli [51]; (m) neural network ofthe nematode worm C. Elegans [5, 52]; tropic interactions between species in the food webs of (n) Ythan Estuary, Scotland [53]and (o) Little Rock Lake, Wisconsin [54].

B. Models of assortative mixing by degree

In Ref. 22 we studied the ensemble of graphs that havea specified value of the matrix ejk and solved exactly forits average properties using generating function methodssimilar to those of Section II B. We showed that the phasetransition at which a giant component first appears insuch networks occurs at a point given by det(I−m) = 0,where m is the matrix with elements mjk = kejk/qj . Onecan also calculate exactly the size of the giant component,and the distribution of sizes of the small components be-low the phase transition. While these developments aremathematically elegant, however, their usefulness is lim-ited by the fact that the generating functions involvedare rarely calculable in closed form for arbitrary speci-fied ejk, and the determinant of the matrix I−m almostnever is. In this paper, therefore, we take an alternativeapproach, making use of computer simulation.

We would like to generate on a computer a randomnetwork having, for instance, a particular value of thematrix ejk. (This also fixes the degree distribution, viaEq. (23).) In Ref. 22 we discussed one possible way ofdoing this using an algorithm similar that of Section II C.One would draw edges from the desired distribution ejk

and then join the degree k ends randomly in groups of kto create the network. (This algorithm has also been

discussed recently by Dorogovtsev et al. [40].) As wepointed out, however, this algorithm is flawed becausein order to create a network without any dangling edgesthe number of degree k ends must be a multiple of k forall k. It is very unlikely that these constraints will besatisfied by chance, and there does not appear to be anysimple way of arranging for them to be satisfied withoutintroducing bias into the ensemble of graphs. Instead,therefore, we use a Monte Carlo sampling scheme which isessentially equivalent to the Metropolis–Hastings methodwidely used in the mathematical and social sciences forgenerating model networks [55, 56]. The algorithm is asfollows.

1. Given the desired edge distribution ejk, we firstcalculate the corresponding distribution of excessdegrees qk from Eq. (23), and then invert Eq. (22)to find the degree distribution:

pk =qk−1/k∑

j qj−1/j. (27)

Note that this equation cannot tell us how manyvertices there are of degree zero in the network.This information is not contained in the edge dis-tribution ejk since no edges connect to degree-zerovertices, and so must be specified separately. On

degree

kikj

Page 33: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

homophily and assortative mixing like links with like !categorical variables:

let be fraction of edges connecting vertices of type to vertices of type

Newman, Phys. Rev. E 67, 026126 (2003).

eij

ij

X

ij

eij = 1

X

j

eij = aiX

i

eij = bj

matrix sum

marginals

Page 34: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

homophily and assortative mixing like links with like !categorical variables:

assortativity coefficient

Newman, Phys. Rev. E 67, 026126 (2003).

r =

Pi eii �

Pi aibi

1�P

i aibi

=Tr e� ||e2||1� ||e2||

* this equation is equivalent to the popular modularity measure Q used to score the strength of community structure

*

Page 35: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative mixing

Newman, Phys. Rev. E 67, 026126 (2003).

2

womenblack hispanic white other ai

men

black 0.258 0.016 0.035 0.013 0.323hispanic 0.012 0.157 0.058 0.019 0.247

white 0.013 0.023 0.306 0.035 0.377other 0.005 0.007 0.024 0.016 0.053

bi 0.289 0.204 0.423 0.084

TABLE I: The mixing matrix eij and the values of ai andbi for sexual partnerships in the study of Catania et al. [23].After Morris [24].

effect on network structure and behavior. The outlineof the paper is as follows. In Section II we study theeffects of assortative mixing by discrete characteristicssuch as language or race. In Section III we study mixingby scalar properties such as age and particularly vertexdegree; since degree is itself a property of the networktopology, the latter type of mixing leads to some novelnetwork structures not seen with other types. In Sec-tion IV we give our conclusions. A preliminary report ofsome of the results in this paper has appeared previouslyas Ref. 22.

II. DISCRETE CHARACTERISTICS

In this section we consider assortative mixing accord-ing to discrete or enumerative vertex characteristics.Such mixing can be characterized by a quantity eij , whichwe define to be the fraction of edges in a network thatconnect a vertex of type i to one of type j. On an undi-rected network this quantity is symmetric in its indiceseij = eji, while on directed networks or bipartite net-works it may be asymmetric. It satisfies the sum rules

!

ij

eij = 1,!

j

eij = ai,!

i

eij = bj , (1)

where ai and bi are the fraction of each type of end of anedge that is attached to vertices of type i. On undirectedgraphs, where the ends of edges are all of the same type,ai = bi [75].

For example, Table I shows data on the values of eij

for mixing by race among sexual partners in a 1992 studycarried out in the city of San Francisco, California [23].This part of the study focused on heterosexuals, so thisis a bipartite network, the two vertex types representingmen and women, with edges running only between ver-tices of unlike types. This means that in this case theends of an edge are different and the matrix eij is asym-metric. As the table shows, mixing is highly assortativein this network, with individuals strongly preferring part-ners from the same group as themselves.

A. Measuring discrete assortative mixing

To quantify the level of assortative mixing in a networkwe define an assortativity coefficient thus:

r =

"

i eii −"

i aibi

1 −"

i aibi=

Tr e − ∥ e2 ∥

1 − ∥ e2 ∥, (2)

where e is the matrix whose elements are eij and ∥x ∥means the sum of all elements of the matrix x. Thisformula gives r = 0 when there is no assortative mixing,since eij = aibj in that case, and r = 1 when there isperfect assortative mixing and

"

i eii = 1. If the networkis perfectly disassortative, i.e., every edge connects twovertices of different types, then r is negative and has thevalue

rmin = −

"

i aibi

1 −"

i aibi, (3)

which lies in general in the range −1 ≤ r < 0. Onemight ask what this value signifies. Why do we not sim-ply have r = −1 for a perfectly disassortative network?The answer is that a perfectly disassortative network isnormally closer to a randomly mixed network than is aperfectly assortative network. When there are several dif-ferent vertex types (e.g., four in the case shown in Table I)then random mixing will most often pair unlike vertices,so that the network appears to be mostly disassortative.Therefore it is appropriate that the value r = 0 for therandom network should be closer to that for the perfectlydisassortative network than for the perfectly assortativeone.

A quantity with properties similar to those of Eq. (2)has been proposed previously by Gupta et al. [25]. How-ever the definition of Gupta et al. gives misleading resultsin certain situations, such as, for example, when one typeof vertex is much less numerous than other types, as is thecase in Table I. In this paper therefore we use Eq. (2),which doesn’t suffer from this problem. The differencebetween the two measures is discussed in more detail inAppendix A.

Using the values from Table I in Eq. (2), we find thatr = 0.621 for the network of sexual partnerships, imply-ing, as we observed already, that this network is stronglyassortative by race—individuals draw their partners fromtheir own group far more often than one would expect onthe basis of pure chance.

As another example of the application of Eq. (2), con-sider the network studied by Girvan and Newman [16]representing the timetable of American college footballgames, in which vertices represent universities and col-leges, and edges represent regular season games betweenteams during the year in question. Colleges are groupedinto “conferences,” which are defined primarily by geog-raphy, and teams normally play more often against otherteams in their own conference than they do against teamsfrom other conferences. In other words, there should beassortative mixing of colleges by conference in the sched-ule network. For the 2000 season schedule studied in

r = 0.621strongly assortative

1992 study of heterosexual partnerships in San Francisco* (bipartite network)

*Catania et al., Am. J. Public Health 82, 284-287 (1992).

Page 36: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

r = 0.264

assortative mixing

moderately assortative

Northeast Midwest South West Canada aiNortheast 0.119 0.053 0.074 0.055 0.022 0.322

Midwest 0.031 0.067 0.061 0.026 0.011 0.196

South 0.025 0.027 0.083 0.024 0.006 0.166

West 0.049 0.033 0.043 0.073 0.011 0.209

Canada 0.006 0.005 0.005 0.005 0.085 0.107

bi 0.229 0.185 0.267 0.184 0.135

4388 Computer Science faculty vertices are PhD granting institutions in North America edge means PhD at and now faculty at labels are US census regions + Canada

(u, v) u v

NortheastMidwestSouthWestCanada

Page 37: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

~vi~vj

assortative mixing

homophily and assortative mixing like links with like !• random graphs tend to be

disassortative because the mixing is uniform

• social networks (apparently) highly assortative, in every way (attribute, degree, category)

• extremal values suggest underlying mechanism on that variable

Newman, Phys. Rev. E 67, 026126 (2003).

r 0

r ⇡ {�1, 1}

Page 38: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

describing networks

community structure

Page 39: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

describing networks

community structure: a group of vertices that connect to other groups in similar ways

assortative community structure (edges inside the groups)

00.5

1

a(t) 0

1

0

1

020

040

060

0

0

1

alig

nmen

t pos

ition

t

1

23

4

56

78

9

calcu

late

alig

nmen

t sco

res

conv

ert t

o al

ignm

ent i

ndica

tors

rem

ove

shor

t alig

ned

regi

ons

extra

ct h

ighl

y va

riabl

e re

gion

s

NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY

NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY

NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY

16

6

13

166

13

A

B

C

D

Page 40: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

community structure

community structure: a group of vertices that connect to other groups in similar ways

“bridge” edges

“internal” edges

assortative community structure (edges inside the groups)

00.5

1

a(t) 0

1

0

1

020

040

060

0

0

1

alig

nmen

t pos

ition

t

1

23

4

56

78

9

calcu

late

alig

nmen

t sco

res

conv

ert t

o al

ignm

ent i

ndica

tors

rem

ove

shor

t alig

ned

regi

ons

extra

ct h

ighl

y va

riabl

e re

gion

s

NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY

NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY

NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY

16

6

13

166

13

A

B

C

D

Page 41: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

community structure

community structure: a group of vertices that connect to other groups in similar ways

00.5

1

a(t) 0

1

0

1

020

040

060

0

0

1

alig

nmen

t pos

ition

t

1

23

4

56

78

9

calcu

late

alig

nmen

t sco

res

conv

ert t

o al

ignm

ent i

ndica

tors

rem

ove

shor

t alig

ned

regi

ons

extra

ct h

ighl

y va

riabl

e re

gion

s

NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY

NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY

NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY

16

6

13

166

13

A

B

C

D

00.5

1

a(t) 0

1

0

1

020

040

060

0

0

1

alig

nmen

t pos

ition

t

1

23

4

56

78

9

calcu

late

alig

nmen

t sco

res

conv

ert t

o al

ignm

ent i

ndica

tors

rem

ove

shor

t alig

ned

regi

ons

extra

ct h

ighl

y va

riabl

e re

gion

s

NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY

NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY

NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY

16

6

13

166

13

A

B

C

D

mixing matrix

Page 42: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

assortative edges within groups

disassortative edges between groups

ordered linear group hierarchy

core-periphery dense core, sparse periphery

community structure

Page 43: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

!• enormous interest, especially since 2000

• dozens of algorithms for extracting various large-scale patterns

• hundreds of papers published

• spanning Physics, Computer Science, Statistics, Biology, Sociology, and more

• this was one of the first:

PNAS 2002

5700+ citations on Google Scholar

assortative disassortative

ordered core-periphery

community structure

Page 44: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network communities

1983

most new job opportunities from “weak ties” • within-community links = strong • bridge links = weak

Page 45: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network communities

1983

most new job opportunities from “weak ties” • within-community links = strong • bridge links = weak

why? information propagates quickly within a community, but slowly between communities

Page 46: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network communities

2004

liberal

conservative

1494 blogs 759 liberal 735 conservative

Page 47: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network communities

amazon.com co-purchasing network

2004

Page 48: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network communities

amazon.com co-purchasing network

find partition that maximizes assortativity on those groups

2004

m = 2,464,630 edgesn = 409,687 items

r

Page 49: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network communities

clustered network

purchases = interestsinterests = clustered

Page 50: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

network communities

political books on amazon (2004)

©2004 Valdis Krebs

Page 51: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

• community = vertices with same pattern of inter-community connections

• network macro-structure • finding them like “network clustering” • allow us to coarse grain system structure

[decompose heterogeneous structure into homogeneous blocks]

• constrains network synchronization, information flows, diffusion, influence

network communities

Page 52: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

• community = vertices with same pattern of inter-community connections

• network macro-structure • finding them like “network clustering” • allow us to coarse grain system structure

[decompose heterogeneous structure into homogeneous blocks]

• constrains network synchronization, information flows, diffusion, influence

open questions:

• what processes generate communities? • what impact on dynamics? network function?

network communities

Page 53: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

describing networks

motifs

Page 54: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

describing networks

motifs: small subgraphs (of interest), which we then count compare counts against null model (random graph model)

Cl concentrations in the Sajama ice core, and toa number of other pedological and geomorpho-logical features indicative of long-term dry cli-mates (8, 11–14, 18). This decline in humanactivity around the Altiplano paleolakes is seenin most caves, with early and late occupationsseparated by largely sterile mid-Holocene sed-iments. However, a few sites, including thecaves of Tulan-67 and Tulan-68, show thatpeople did not completely disappear from thearea. All of the sites of sporadic occupationare located near wetlands in valleys, nearlarge springs, or where lakes turned into wet-lands and subsistence resources were locallystill available despite a generally arid climate(7, 8, 19, 20).

Archaeological data from surrounding ar-eas suggest that the Silencio Arqueologicoapplies best to the most arid areas of thecentral Andes, where aridity thresholds forearly societies were critical. In contrast, aweaker expression is to be expected in themore humid highlands of northern Chile(north of 20°S, such as Salar Huasco) andPeru (21). In northwest Argentina, the Silen-cio Arqueologico is found in four of the sixknown caves (22) [see review in (23)]. It isalso found on the coast of Peru in sites thatare associated with ephemeral streams (24).The southern limit in Chile and northwestArgentina has yet to be explored.

References and Notes1. T. Dillehay, Science 245, 1436 (1989).2. D. J. Meltzer et al., Am. Antiq. 62, 659 (1997).3. T. F. Lynch, C. M. Stevenson, Quat. Res. 37, 117(1992).

4. D. H. Sandweiss et al., Science 281, 1830 (1998).5. L. Nunez, M. Grosjean, I. Cartajena, in Interhemispher-ic Climate Linkages, V. Markgraf, Ed. (Academic Press,San Diego, CA 2001), pp. 105–117.

6. M. A. Geyh, M. Grosjean, L. Nunez, U. Schotterer,Quat. Res. 52, 143 (1999).

7. J. L. Betancourt, C. Latorre, J. A. Rech, J. Quade, K.Rylander, Science 289, 1542 (2000).

8. M. Grosjean et al., Global Planet. Change 28, 35(2001).

9. C. Latorre, J. L. Betancourt, K. A. Rylander, J. Quade,Geol. Soc. Am. Bull. 114, 349 (2002).

10. Charcoal in layers containing triangular points hasbeen 14C dated at Tuina-1, Tuina-5, Tambillo-1, SanLorenzo-1, and Tuyajto-1 between 13,000 and 9000cal yr B.P. (table S1 and fig. S1).

11. P. A. Baker et al., Science 291, 640 (2001).12. G. O. Seltzer, S. Cross, P. Baker, R. Dunbar, S. Fritz,Geology 26, 167 (1998).

13. L. G. Thompson et al., Science 282, 1858 (1998).14. M. Grosjean, Science 292, 2391 (2001).15. E. P. Tonni, written communication.16. M. T. Alberdi, written communication.17. J. Fernandez et al., Geoarchaeology 6, 251 (1991).18. The histogram of middens is processed from (9).19. M. Grosjean, L. Nunez, I. Cartajena, B. Messerli, Quat.Res. 48, 239 (1997).

20. The term Silencio Arqueologico describes the mid-Holocene collapse of human population at thosearchaeological sites of the Atacama Desert that arevulnerable to multicentennial or millennial-scaledrought. The term Silencio Archaeologico does notconflict with the presence of humans at sites that arenot susceptible to climate change, such as in springand river oases that drain large (Pleistocene) aquifersor at sites where wetlands were created during thearid middle Holocene, such as Tulan-67, Tulan-68,and Laguna Miscanti.

21. M. Aldenderfer, Science 241, 1828 (1988).22. A mid-Holocene hiatus is found at Inca Cueva 4,Huachichocana 3, Pintocamayoc, and Yavi, whereasoccupation continued at the oases of Susques andQuebrada Seca.

23. L. Nunez et al., Estud. Atacamenos 17, 125 (1999).24. D. H. Sandweiss, K. A. Maasch, D. G. Anderson, Sci-ence 283, 499 (1999).

25. Grants from the National Geographic Society (5836-96), the Swiss National Science Foundation(21-57073), and Fondo Nacional de Desarrollo Cien-

tıfico y Tecnologico (1930022) and comments by J. P.Bradbury, B. Meggers, G. Seltzer, and D. Stanford areacknowledged.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/298/5594/821/DC1Figs. S1 to S3Tables S1 and S2

22 July 2002; accepted 9 September 2002

Network Motifs: Simple BuildingBlocks of Complex Networks

R. Milo,1 S. Shen-Orr,1 S. Itzkovitz,1 N. Kashtan,1 D. Chklovskii,2

U. Alon1*

Complex networks are studied across many fields of science. To uncover theirstructural design principles, we defined “network motifs,” patterns of inter-connections occurring in complex networks at numbers that are significantlyhigher than those in randomized networks. We found such motifs in networksfrom biochemistry, neurobiology, ecology, and engineering. The motifs sharedby ecological food webs were distinct from the motifs shared by the geneticnetworks of Escherichia coli and Saccharomyces cerevisiae or from those foundin the World Wide Web. Similar motifs were found in networks that performinformation processing, even though they describe elements as different asbiomolecules within a cell and synaptic connections between neurons in Cae-norhabditis elegans. Motifs may thus define universal classes of networks. Thisapproach may uncover the basic building blocks of most networks.

Many of the complex networks that occur innature have been shown to share global statis-tical features (1–10). These include the “smallworld” property (1–9) of short paths betweenany two nodes and highly clustered connec-tions. In addition, in many natural networks,there are a few nodes with many more connec-tions than the average node has. In these types

of networks, termed “scale-free networks” (4,6), the fraction of nodes having k edges, p(k),decays as a power law p(k) ! k–" (where " isoften between 2 and 3). To go beyond theseglobal features would require an understandingof the basic structural elements particular toeach class of networks (9). To do this, wedeveloped an algorithm for detecting networkmotifs: recurring, significant patterns of inter-connections. A detailed application to a generegulation network has been presented (11).Related methods were used to test hypotheseson social networks (12, 13). Here we generalizethis approach to virtually any type of connec-tivity graph and find the striking appearance of

1Departments of Physics of Complex Systems andMolecular Cell Biology, Weizmann Institute of Sci-ence, Rehovot, Israel 76100. 2Cold Spring Harbor Lab-oratory, Cold Spring Harbor, NY 11724, USA.

*To whom correspondence should be addressed. E-mail: [email protected]

Fig. 1. (A) Examplesof interactions repre-sented by directededges between nodesin some of the net-works used for thepresent study. Thesenetworks go from thescale of biomolecules(transcription factorprotein X binds regu-latory DNA regionsof a gene to regulatethe production rateof protein Y),through cells (neuronX is synaptically con-nected to neuron Y),to organisms (Xfeeds on Y). (B) All 13 types of three-node connected subgraphs.

R E P O R T S

25 OCTOBER 2002 VOL 298 SCIENCE www.sciencemag.org824

Milo et al., Science 298, 824-827 (2002).

Page 55: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

describing networks

motifs: small subgraphs (of interest), which we then count compare counts against null model (random graph model)

• efficient counting is tricky (combinatorics + graph isomorphism)

• choice of null model key (see Lecture 2)

• lots of work in this area, mainly in molecular biology and neuroscience

• seeSporns and Kotter, PLoS Biol. 2, e369 (2004) Matias et al., REVSTAT 4, 31-51 (2006) Wong et al., Brief. in Bioinfo. 13, 202-215 (2011)

Page 56: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

100150

200250

300

end of lecture 3

Page 57: Five Lectures on Networks - Tuvalutuvalu.santafe.edu/~aaronc/slides/Clauset_CSSS2014_Networks_3.pdf · Pajek data sets! Linkgroup’s list of network data sets! Barabasi lab data

• The structure and function of complex networks. M. E. J. Newman, SIAM Review 45, 167–256 (2003).

• The Structure and Dynamics of Networks. M. E. J. Newman, A.-L. Barabási, and D. J. Watts, Princeton University Press (2006).

• Hierarchical structure and the prediction of missing links in networks. A. Clauset, C. Moore, and M. E. J. Newman, Nature 453, 98–101 (2008).

• Modularity and community structure in networks. M. E. J. Newman, Proc. Natl. Acad. Sci. USA 103, 8577–8582 (2006).

• Why social networks are different from other types of networks. M. E. J. Newman and J. Park, Phys. Rev. E 68, 036122 (2003)

• Random graphs with arbitrary degree distributions and their applications. M. E. J. Newman, S. H. Strogatz, and D. J. Watts, Phys. Rev. E 64, 026118 (2001).

• Comparing community structure identification. L. Danon, A. Diaz-Guilera, J. Duch and A. Arenas. J. Stat. Mech. P09008 (2005).

• Characterization of Complex Networks: A Survey of measurements. L. daF. Costa, F. A. Rodrigues, G. Travieso and P. R. VillasBoas. arxiv:cond-mat/050585 (2005).

• Evolution in Networks. S.N. Dorogovtsev and J. F. F. Mendes. Adv. Phys. 51, 1079 (2002).

• Revisting “scale-free” networks. E. F. Keller. BioEssays 27, 1060-1068 (2005).

• Currency metabolites and network representations of metabolism. P. Holme and M. Huss. arxiv:0806.2763 (2008).

• Functional cartography of complex metabolic networks. R. Guimera and L. A. N. Amaral. Nature 433, 895 (2005).

• Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations. J. Leskovec, J. Kleinberg and C. Faloutsos. Proc. 11th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining 2005.

• The Structure of the Web. J. Kleinberg and S. Lawrence. Science 294, 1849 (2001).

• Navigation in a Small World. J. Kleinberg. Nature 406 (2000), 845.

• Towards a Theory of Scale-Free Graphs: Definitions, Properties and Implications. L. Li, D. Alderson, J. Doyle, and W. Willinger. Internet Mathematics 2(4), 2006. 

• A First-Principles Approach to Understanding the Internet’s Router-Level Topology. L. Li, D. Alderson, W. Willinger, and J. Doyle. ACM SIGCOMM 2004.

• Inferring network mechanisms: The Drosophila melanogaster protein interaction network. M. Middendorf, E. Ziv and C. H. Wiggins. Proc. Natl. Acad. Sci. USA 102, 3192 (2005).

• Robustness Can Evolve Gradually in Complex Regulatory Gene Networks with Varying Topology. S. Ciliberti, O. C. Martin and A. Wagner. PLoS Comp. Bio. 3, e15 (2007).

• Simple rules yield complex food webs. R. J. Williams and N. D. Martinez. Nature 404, 180 (2000).

• A network analysis of committees in the U.S. House of Representatives. M. A. Porter, P. J. Mucha, M. E. J. Newman and C. M. Warmbrand. Proc. Natl. Acad. Sci. USA 102, 7057 (2005).

• On the Robustness of Centrality Measures under Conditions of Imperfect Data. S. P. Borgatti, K. M. Carley and D. Krackhardt. Social Networks 28, 124 (2006).

selected references