18-networks2

19
CS109/Stat121/AC209/E-109 Data Science Network Models II Hanspeter Pfister & Joe Blitzstein pfi[email protected] / [email protected] 1 5 4 3 2

Upload: dexxt0r

Post on 28-Apr-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18-Networks2

CS109/Stat121/AC209/E-109 Data Science

Network Models IIHanspeter Pfister & Joe Blitzstein

[email protected] / [email protected]

1

5

4 3

2

Page 2: 18-Networks2

This Week• Project proposals due next Monday (Nov 11)

http://cs109.org/projects/projects.php

• No late days or extensions are possible on project milestones or deadlines!

• HW5 due next Friday (Nov 15)

• Friday lab 10-11:30 am in MD G115

Page 3: 18-Networks2

4.2 Vertex and Edge Characteristics 91

(a) (b)

(c) (d)

Fig. 4.4 Illustration of (b) closeness, (c) betweenness, and (d) eigenvector centrality measures onthe graph in (a). Example and figures courtesy of Ulrik Brandes.

Figures 4.4(b) – (d) provide visual summaries of the closeness, betweenness, andeigenvector centralities, respectively, of the vertices in our toy graph. In each case,vertices are arranged using a radial layout, with more central vertices located closerto the center.

Under the closeness centrality, cCl(v), the dark blue vertex is judged to be mostcentral under this measure, followed closely by the red and yellow vertices. How-ever, under the betweenness centrality, cB(v), it can be seen that the yellow vertexis now judged to be most central. In fact, note that the relative positions of all butthe most extreme vertices (i.e., small green) have changed noticeably from Fig-

Kolaczyk (2009)

Comparing centrality measures

Page 4: 18-Networks2

It’s a small world after all: Watts-Strogatz Model

Nature © Macmillan Publishers Ltd 1998

8

letters to nature

NATURE | VOL 393 | 4 JUNE 1998 441

removed from a clustered neighbourhood to make a short cut has, atmost, a linear effect on C; hence C(p) remains practically unchangedfor small p even though L(p) drops rapidly. The important implica-tion here is that at the local level (as reflected by C(p)), the transitionto a small world is almost undetectable. To check the robustness ofthese results, we have tested many different types of initial regulargraphs, as well as different algorithms for random rewiring, and allgive qualitatively similar results. The only requirement is that therewired edges must typically connect vertices that would otherwisebe much farther apart than Lrandom.

The idealized construction above reveals the key role of shortcuts. It suggests that the small-world phenomenon might becommon in sparse networks with many vertices, as even a tinyfraction of short cuts would suffice. To test this idea, we havecomputed L and C for the collaboration graph of actors in featurefilms (generated from data available at http://us.imdb.com), theelectrical power grid of the western United States, and the neuralnetwork of the nematode worm C. elegans17. All three graphs are ofscientific interest. The graph of film actors is a surrogate for a socialnetwork18, with the advantage of being much more easily specified.It is also akin to the graph of mathematical collaborations centred,traditionally, on P. Erdos (partial data available at http://www.acs.oakland.edu/,grossman/erdoshp.html). The graph ofthe power grid is relevant to the efficiency and robustness ofpower networks19. And C. elegans is the sole example of a completelymapped neural network.

Table 1 shows that all three graphs are small-world networks.These examples were not hand-picked; they were chosen because oftheir inherent interest and because complete wiring diagrams wereavailable. Thus the small-world phenomenon is not merely acuriosity of social networks13,14 nor an artefact of an idealized

model—it is probably generic for many large, sparse networksfound in nature.

We now investigate the functional significance of small-worldconnectivity for dynamical systems. Our test case is a deliberatelysimplified model for the spread of an infectious disease. Thepopulation structure is modelled by the family of graphs describedin Fig. 1. At time t ¼ 0, a single infective individual is introducedinto an otherwise healthy population. Infective individuals areremoved permanently (by immunity or death) after a period ofsickness that lasts one unit of dimensionless time. During this time,each infective individual can infect each of its healthy neighbourswith probability r. On subsequent time steps, the disease spreadsalong the edges of the graph until it either infects the entirepopulation, or it dies out, having infected some fraction of thepopulation in the process.

p = 0 p = 1 Increasing randomness

Regular Small-world Random

Figure 1 Random rewiring procedure for interpolating between a regular ring

lattice and a random network, without altering the number of vertices or edges in

the graph. We start with a ring of n vertices, each connected to its k nearest

neighbours by undirected edges. (For clarity, n ¼ 20 and k ¼ 4 in the schematic

examples shown here, but much larger n and k are used in the rest of this Letter.)

We choose a vertex and the edge that connects it to its nearest neighbour in a

clockwise sense. With probability p, we reconnect this edge to a vertex chosen

uniformly at random over the entire ring, with duplicate edges forbidden; other-

wise we leave the edge in place. We repeat this process by moving clockwise

around the ring, considering each vertex in turn until one lap is completed. Next,

we consider the edges that connect vertices to their second-nearest neighbours

clockwise. As before, we randomly rewire each of these edges with probability p,

and continue this process, circulating around the ring and proceeding outward to

more distant neighbours after each lap, until each edge in the original lattice has

been considered once. (As there are nk/2 edges in the entire graph, the rewiring

process stops after k/2 laps.) Three realizations of this process are shown, for

different values of p. For p ¼ 0, the original ring is unchanged; as p increases, the

graph becomes increasingly disordered until for p ¼ 1, all edges are rewired

randomly. One of our main results is that for intermediate values of p, the graph is

a small-world network: highly clustered like a regular graph, yet with small

characteristic path length, like a random graph. (See Fig. 2.)

Table 1 Empirical examples of small-world networks

Lactual Lrandom Cactual Crandom.............................................................................................................................................................................Film actors 3.65 2.99 0.79 0.00027Power grid 18.7 12.4 0.080 0.005C. elegans 2.65 2.25 0.28 0.05.............................................................................................................................................................................Characteristic path length L and clustering coefficient C for three real networks, comparedto random graphs with the same number of vertices (n) and average number of edges pervertex (k). (Actors: n ¼ 225;226, k ¼ 61. Power grid: n ¼ 4;941, k ¼ 2:67. C. elegans: n ¼ 282,k ¼ 14.) The graphs are defined as follows. Two actors are joined by an edge if they haveacted in a film together. We restrict attention to the giant connected component16 of thisgraph, which includes ,90% of all actors listed in the Internet Movie Database (available athttp://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,transformers and substations, and edges represent high-voltage transmission linesbetween them. For C. elegans, an edge joins two neurons if they are connected by eithera synapse or a gap junction. We treat all edges as undirected and unweighted, and allvertices as identical, recognizing that these are crude approximations. All three networksshow the small-world phenomenon: L ) Lrandom but C q Crandom.

0

0.2

0.4

0.6

0.8

1

0.0001 0.001 0.01 0.1 1p

L(p) / L(0)

C(p) / C(0)

Figure 2 Characteristic path length L(p) and clustering coefficient C(p) for the

family of randomly rewired graphs described in Fig. 1. Here L is defined as the

number of edges in the shortest path between two vertices, averaged over all

pairs of vertices. The clustering coefficient C(p) is defined as follows. Suppose

that a vertex v has kv neighbours; then at most kvðkv 2 1Þ=2 edges can exist

between them (this occurs when every neighbour of v is connected to everyother

neighbour of v). Let Cv denote the fraction of these allowable edges that actually

exist. Define C as the average of Cv over all v. For friendship networks, these

statistics have intuitive meanings: L is the average number of friendships in the

shortest chain connecting two people; Cv reflects the extent to which friends of v

are also friends of each other; and thus C measures the cliquishness of a typical

friendship circle. The data shown in the figure are averages over 20 random

realizations of the rewiring process described in Fig.1, and have been normalized

by the values L(0), C(0) for a regular lattice. All the graphs have n ¼ 1;000 vertices

and an average degree of k ¼ 10 edges per vertex. We note that a logarithmic

horizontal scale has been used to resolve the rapid drop in L(p), corresponding to

the onset of the small-world phenomenon. During this drop, C(p) remains almost

constant at its value for the regular lattice, indicating that the transition to a small

world is almost undetectable at the local level.

Watts-Strogatz (Nature, 1998)

Page 5: 18-Networks2

Distances and clustering in Watts-Strogatz model

Watts-Strogatz (Nature, 1998)

Nature © Macmillan Publishers Ltd 1998

8

letters to nature

NATURE | VOL 393 | 4 JUNE 1998 441

removed from a clustered neighbourhood to make a short cut has, atmost, a linear effect on C; hence C(p) remains practically unchangedfor small p even though L(p) drops rapidly. The important implica-tion here is that at the local level (as reflected by C(p)), the transitionto a small world is almost undetectable. To check the robustness ofthese results, we have tested many different types of initial regulargraphs, as well as different algorithms for random rewiring, and allgive qualitatively similar results. The only requirement is that therewired edges must typically connect vertices that would otherwisebe much farther apart than Lrandom.

The idealized construction above reveals the key role of shortcuts. It suggests that the small-world phenomenon might becommon in sparse networks with many vertices, as even a tinyfraction of short cuts would suffice. To test this idea, we havecomputed L and C for the collaboration graph of actors in featurefilms (generated from data available at http://us.imdb.com), theelectrical power grid of the western United States, and the neuralnetwork of the nematode worm C. elegans17. All three graphs are ofscientific interest. The graph of film actors is a surrogate for a socialnetwork18, with the advantage of being much more easily specified.It is also akin to the graph of mathematical collaborations centred,traditionally, on P. Erdos (partial data available at http://www.acs.oakland.edu/,grossman/erdoshp.html). The graph ofthe power grid is relevant to the efficiency and robustness ofpower networks19. And C. elegans is the sole example of a completelymapped neural network.

Table 1 shows that all three graphs are small-world networks.These examples were not hand-picked; they were chosen because oftheir inherent interest and because complete wiring diagrams wereavailable. Thus the small-world phenomenon is not merely acuriosity of social networks13,14 nor an artefact of an idealized

model—it is probably generic for many large, sparse networksfound in nature.

We now investigate the functional significance of small-worldconnectivity for dynamical systems. Our test case is a deliberatelysimplified model for the spread of an infectious disease. Thepopulation structure is modelled by the family of graphs describedin Fig. 1. At time t ¼ 0, a single infective individual is introducedinto an otherwise healthy population. Infective individuals areremoved permanently (by immunity or death) after a period ofsickness that lasts one unit of dimensionless time. During this time,each infective individual can infect each of its healthy neighbourswith probability r. On subsequent time steps, the disease spreadsalong the edges of the graph until it either infects the entirepopulation, or it dies out, having infected some fraction of thepopulation in the process.

p = 0 p = 1 Increasing randomness

Regular Small-world Random

Figure 1 Random rewiring procedure for interpolating between a regular ring

lattice and a random network, without altering the number of vertices or edges in

the graph. We start with a ring of n vertices, each connected to its k nearest

neighbours by undirected edges. (For clarity, n ¼ 20 and k ¼ 4 in the schematic

examples shown here, but much larger n and k are used in the rest of this Letter.)

We choose a vertex and the edge that connects it to its nearest neighbour in a

clockwise sense. With probability p, we reconnect this edge to a vertex chosen

uniformly at random over the entire ring, with duplicate edges forbidden; other-

wise we leave the edge in place. We repeat this process by moving clockwise

around the ring, considering each vertex in turn until one lap is completed. Next,

we consider the edges that connect vertices to their second-nearest neighbours

clockwise. As before, we randomly rewire each of these edges with probability p,

and continue this process, circulating around the ring and proceeding outward to

more distant neighbours after each lap, until each edge in the original lattice has

been considered once. (As there are nk/2 edges in the entire graph, the rewiring

process stops after k/2 laps.) Three realizations of this process are shown, for

different values of p. For p ¼ 0, the original ring is unchanged; as p increases, the

graph becomes increasingly disordered until for p ¼ 1, all edges are rewired

randomly. One of our main results is that for intermediate values of p, the graph is

a small-world network: highly clustered like a regular graph, yet with small

characteristic path length, like a random graph. (See Fig. 2.)

Table 1 Empirical examples of small-world networks

Lactual Lrandom Cactual Crandom.............................................................................................................................................................................Film actors 3.65 2.99 0.79 0.00027Power grid 18.7 12.4 0.080 0.005C. elegans 2.65 2.25 0.28 0.05.............................................................................................................................................................................Characteristic path length L and clustering coefficient C for three real networks, comparedto random graphs with the same number of vertices (n) and average number of edges pervertex (k). (Actors: n ¼ 225;226, k ¼ 61. Power grid: n ¼ 4;941, k ¼ 2:67. C. elegans: n ¼ 282,k ¼ 14.) The graphs are defined as follows. Two actors are joined by an edge if they haveacted in a film together. We restrict attention to the giant connected component16 of thisgraph, which includes ,90% of all actors listed in the Internet Movie Database (available athttp://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,transformers and substations, and edges represent high-voltage transmission linesbetween them. For C. elegans, an edge joins two neurons if they are connected by eithera synapse or a gap junction. We treat all edges as undirected and unweighted, and allvertices as identical, recognizing that these are crude approximations. All three networksshow the small-world phenomenon: L ) Lrandom but C q Crandom.

0

0.2

0.4

0.6

0.8

1

0.0001 0.001 0.01 0.1 1p

L(p) / L(0)

C(p) / C(0)

Figure 2 Characteristic path length L(p) and clustering coefficient C(p) for the

family of randomly rewired graphs described in Fig. 1. Here L is defined as the

number of edges in the shortest path between two vertices, averaged over all

pairs of vertices. The clustering coefficient C(p) is defined as follows. Suppose

that a vertex v has kv neighbours; then at most kvðkv 2 1Þ=2 edges can exist

between them (this occurs when every neighbour of v is connected to everyother

neighbour of v). Let Cv denote the fraction of these allowable edges that actually

exist. Define C as the average of Cv over all v. For friendship networks, these

statistics have intuitive meanings: L is the average number of friendships in the

shortest chain connecting two people; Cv reflects the extent to which friends of v

are also friends of each other; and thus C measures the cliquishness of a typical

friendship circle. The data shown in the figure are averages over 20 random

realizations of the rewiring process described in Fig.1, and have been normalized

by the values L(0), C(0) for a regular lattice. All the graphs have n ¼ 1;000 vertices

and an average degree of k ¼ 10 edges per vertex. We note that a logarithmic

horizontal scale has been used to resolve the rapid drop in L(p), corresponding to

the onset of the small-world phenomenon. During this drop, C(p) remains almost

constant at its value for the regular lattice, indicating that the transition to a small

world is almost undetectable at the local level.

Page 6: 18-Networks2

Scientific Communication as Sequential Art (Bret Victor)

http://worrydream.com/ScientificCommunicationAsSequentialArt/

Page 7: 18-Networks2

Class Size Paradox

Why do so many schools boast small average class size but then so many students

end up in huge classes?

Simple example: each student takes one course; suppose there is one course with 100 students,

fifty courses with 2 students.

Dean calculates: (100+50*2)/51 = 3.92

Students calculate: (100*100+100*2)/200 = 51

Page 8: 18-Networks2

Class Size Paradox in Networks

http://opinionator.blogs.nytimes.com/2012/09/17/friends-you-can-count-on/?_r=0

Popular article on this phenomenon by Strogatz:

Average number of friends of a person’s friends is greater than average number of friends of a person!!!Again a reminder of the importance of considering sampling.

Page 9: 18-Networks2

Community Detection

Porter et al survey: http://arxiv.org/pdf/0902.3788v2.pdf

Fig. 0.5. The largest connected component (379 nodes) of the network of network scientists(1589 total nodes), determined by coauthorship of papers listed in two well-known review articles[13, 83] and a small number of papers added manually [86]. Each of the nodes in the network,which we depict using a Kamada-Kawaii visualization [62], is colored according to its communityassignment using the leading-eigenvector spectral method [86].

Applications. Armed with the above ideas and algorithms, we turn to selecteddemonstrations of their efficacy. The increasing rapidity of developments in net-work community detection has resulted in part from the ever-increasing abundanceof data sets (and the ability to extract them, with user cleverness). This newfoundwealth—including large, time-dependent data sets—has, in turn, arisen from the mas-sive amount of information that is now routinely collected on websites and by com-munication companies, governmental agencies, and others. Electronic databases nowprovide detailed records of human communication patterns, offering novel avenues tomap and explore the structure of social, communication, and collaboration networks.Biologists also have extensive data on numerous systems that can be cast into networkform and which beg for additional quantitative analyses.

Because of space limitations, we restrict our discussion to five example applica-tions in which community detection has played a prominent role: scientific coauthor-ship, mobile phone communication, online social networking sites, biological systems,and legislatures. We make no attempt to be exhaustive for any of these examples; wemerely survey research (both by others and by ourselves) that we particularly like.

Scientific Collaboration Networks. We know from the obsessive computationof Erdos numbers that scientists can be quite narcissistic. (If you want any furtherevidence, just take a look at the selection of topics and citations in this section.) Inthis spirit, we use scientific coauthorship networks as our first example.

A bipartite (two-mode) coauthorship network—with scientists linked to papers

14

Page 10: 18-Networks2

Community Detection of Committees in Congress

Porter et al survey: http://arxiv.org/pdf/0902.3788v2.pdf

AGRICULTURE

APPROPRIATIONS

INTERNATIONAL RELATIONS

BUDGET

HOUSE ADMINISTRATION

ENERGY/COMMERCE

FINANCIAL SERVICES

VETERANS’ AFFAIRS

EDUCATION

ARMED SERVICES

JUDICIARY

RESOURCES

RULES

SCIENCE

SMALL BUSINESS

OFFICIAL CONDUCTTRANSPORTATION

GOVERNMENT REFORMWAYS AND MEANS

INTELLIGENCE

HOMELAND SECURITY

AGRICULTURE

APPROPRIATIONS

INTERNATIONAL RELATIONS

BUDGET

HOUSE ADMINISTRATION

ENERGY/COMMERCE

FINANCIAL SERVICES

VETERANS’ AFFAIRS

EDUCATION

ARMED SERVICES

JUDICIARY

RESOURCES

RULES

SCIENCE

SMALL BUSINESS

OFFICIAL CONDUCTTRANSPORTATION

GOVERNMENT REFORM

WAYS AND MEANS

INTELLIGENCE

HOMELAND SECURITY

Fig. 0.4. (Left) The network of committees (squares) and subcommittees (circles) in the 108thU.S. House of Representatives (2003-04), color-coded by the parent standing and select committeesand visualized using the Kamada-Kawaii method [62]. The darkness of each weighted edge betweencommittees indicates how strongly they are connected. Observe that subcommittees of the same parentcommittee are closely connected to each other. (Right) Coarse-grained plot of the communities in thisnetwork. Here one can see some close connections between different committees, such as VeteransAffairs/Transportation and Rules/Homeland Security.

until each node is in its own singleton community. This hierarchical partitioning pro-cess can then be represented by a tree, or dendrogram (see Fig. 0.2). Such processescan yield a hierarchy of nested modules (see Fig. 0.3), or a collection of modules atone mesoscopic level might be obtained in an algorithm independently from those atanother level. However obtained, the community structure of a network refers to theset of graph partitions obtained at each “reasonable” step of such procedures. Notethat community structure investigations rely implicitly on using connected networkcomponents. (We will assume such connectedness in our discussion of community-detection algorithms below.) Community detection can be applied individually toseparate components of networks that are not connected.

Many real-world networks possess a natural hierarchy. For example, the com-mittee assignment network of the U. S. House of Representatives includes the Housefloor, groups of committees, committees, groups of subcommittees within larger com-mittees, and individual subcommittees [100,101]. As shown in Fig. 0.4, different Housecommittees are resolved into distinct modules within this network. At a different hier-archical level, small groups of committees belong to larger but less densely-connectedmodules. To give an example closer to home, let’s consider the departmental organiza-tion at a university and suppose that the network in Fig. 0.3 represents collaborationsamong professors. (It actually represents grassland species interactions [23].) At onelevel of inspection, everybody in the mathematics department might show up in onecommunity, such as the large one in the upper left. Zooming in, however, revealssmaller communities that might represent the department’s subfields.

Although network community structure is almost always fairly complicated, sev-eral forms of it have nonetheless been observed and shown to be insightful in appli-cations. The structures of communities and between communities are important forthe demographic identification of network components and the function of dynamicalprocesses that operate on networks (such as the spread of opinions and diseases) [39].A community in a social network might indicate a circle of friends, a community inthe World Wide Web might indicate a group of pages on closely-related topics, and a

4

Page 11: 18-Networks2

Community Detection AlgorithmsGirvan-Newman algorithm: iteratively remove edges by calculating betweennesses and removing the edge with maximum betweenness.!!Metric called modularity.

http://www.pnas.org/content/103/23/8577/F1.expansion.html

Page 12: 18-Networks2

Bickel-Chen on Community Detection

Inconsistency result for Newman-Girvan.

http://www.stat.berkeley.edu/~bickel/Bickel%20Chen%2021068.full.pdf

STAT

ISTI

CS

Corollary 1: If the conditions of Theorem 1 hold and,

W ≡ L−1O(c, A),

π ={

1n

n∑

i=1

I(ci = a) : a = 1, · · · , K

}

,

then,√

n(π − π) ⇒ N (0, D(π) − ππT ),√

n(W − W ) ⇒ S · (πηT + ηπT ) − 2(πT Sη)W ,

η = N (0, D(π) − ππT ),

with A · B denoting point-wise product. The limiting variances arewhat we would get for maximum likelihood estimates if c = c, i.e. weknew the assignment to begin with. So consistent modularities leadto efficient estimates of the parameters.

This follows since with probability tending to 1, c = c.To estimate w(·, ·) in the nonparametric case we need K →

∞, and w(·, ·) and π(·) smooth. We approximate by WK ∼K−2||w(aK−1, bK−1)||, πK (a) ∼ K−1π(aK−1), where w(·, ·), WKare canonical and the modularity defining FK , FK (·, ·, ·) is oforder K−2. We have preliminary results in that direction but theirformulation is complicated and we do not treat them further here.

Consistency of N-G, L-MWe show in SI Appendix using the appropriate FNG, FLM that thelikelihood modularity is always consistent while the Newman–Girvan is not. This is perhaps not surprising since N-G focuseson the diagonal of O. In fact, we would hope that N-G is consis-tent under the submodel {(ρ, π, W ) : Waa >

∑b=a Wab for all a},

which corresponds to Newman and Girvan’s motivation. We haveshown this for K = 2 but it surprisingly fails for K > 2. Here is acounterexample. Let K = 3, π = (1/3, 1/3, 1/3)T and

P =

⎣.06 .04 0.04 .12 .040 .04 .66

⎦ .

As n → ∞, with true labeling, QNG approaches 0.033. How-ever, the maximum QNG, about 0.038, is achieved by mergingthe first two communities. That is, two sparser communities aremerged. This is consistent with an observation of Fortunato andBarthelemy (17).

If for the profile likelihood we maximize only over e such thatWaa(e) >

∑b =a Wab(e) for all a, we obtain c which is consis-

tent under the submodel above, and in the Karate Club exampleperforms like N-G.

Computational IssuesComputation of optimal assignments using modularities is, in prin-ciple, NP hard. However, although the surface is multimodal, inthe examples we have considered and generally when the signalis strong, optimization from several starting points using a labelswitching algorithm (19) works well.

SimulationWe generate random matrices A and maximize QNG, QLM to obtainnode labels respectively, where QLM is maximized using a labelswitching algorithm. To make a fair comparison, the initial label-ing for QNG and QLM is to randomly choose 50% of the nodes withcorrect labels and the other 50% with random labels. For spectralclustering, we adopt the algorithm of (18) by using the first K eigen-vectors of D(d)−1/2AD(d)−1/2, where d = (d1, · · · , dn)T and di isthe degree of the i-th node. We generate the P matrix randomlyby forcing symmetry and then add a constant to diagonal entries

Fig. 1. Empirical comparison of Newman–Girvan, likelihood modularitiesand spectral clustering (18), where K = 3, the number of nodes n variesfrom 200 to 1500, and the percent of correct labeling is computed from 100replicates of each simulation case. Here π, P are given in the text.

such that I holds. The π is generated randomly from the simplex.To be precise, the values for Fig. 1 are π = (.203, .286, .511)T and

P = bn−1 log n ·

⎣.43 .06 .13.06 .34 .17.13 .17 .40

⎦ ,

where n varies from 200 to 1,500 and b varies from 10 to 100. Obvi-ously, Fig. 1 says that the likelihood method exhibits much lessincorrect labeling than Newman–Girvan and spectral clustering.This is consistent with theoretical comparison.

Data ExamplesWe compare the L-M and N-G modularity algorithms belowwith applications to two real data sets. To deal with the issue ofnon-convex optimization, we simply use many restarting points.

Zachary’s “Karate Club” Network. We first compare L-M and N-Gwith the famous “Karate Club” network of ref. 20, from the socialscience literature, which has become something of a standard testfor community detection algorithms. The network shows the pat-terns of friendship between the members of a karate club at aUS university in the 1970s. The example is of particular interestbecause shortly after the observation and construction of the net-work, the club in question split into two components separatedby the dashed line as shown in Figs. 2 and 3 as a result of aninternal dispute. Fig. 2 Left shows two communities identified bymaximizing the likelihood modularity where the shapes of the ver-tices denote the membership of the corresponding individuals,and similarly the right panel shows communities identified by N-G. Obviously,the N-G communities match the two sub-divisionsidentified by the split save for one mis-classified individual. TheL-M communities are quite different, and obviously one com-munity consists of five individuals with central importance thatconnect with many other nodes while the other community con-sists of the remaining individuals. Although not reflecting the splitthis corresponds to other plausible distinguishing characteristicsof the individuals. However, if we force the constraint that within-community density is no less than the density of relationship to allother communities, the submodel we discussed, then we obtain twoL-M communities that match the split perfectly. The same parti-tions as ours with and without constraint have also been reported

Bickel and Chen PNAS December 15, 2009 vol. 106 no. 50 21071

Page 13: 18-Networks2

Respondent Driven Sampling (RDS)!

• sampling scheme for hard-to-reach populations, based on link-tracing across a social network with coupon incentives

• becoming extremely-widely used all over the world; hundreds of studies done or ongoing, e.g., CDC National HIV Behavioral Surveillance (NHBS) studies of injection drug users

• RDS as sampling vs. RDS estimation

0

1

2

3

4 5 6 7 8

9

10

11

12

Page 14: 18-Networks2

Is RDS contact tracing?

Source: http://www.eurosurveillance.org/

Page 15: 18-Networks2

Recruitment Tree Example

Page 16: 18-Networks2

Volz-Heckathorn RDS Estimator

This is a form of Horvitz-Thompson estimator, reweighting as in importance sampling.

ˆE(Y ) =Pn

j=1 Yj/djPnj=1 1/dj

Relies on a long list of strong assumptions; Handcock-Gile and Blitzstein-Nesterko perform sensitivity analyses under

various conditions.

Page 17: 18-Networks2

Goel-Salganik (Stats in Medicine 2009, PNAS 2010):

RDS variances can be extremely large, especially if there are bottlenecks in the network from modularity/

communities, and from multiple recruitment. Typical design effects of 5-10, and coverage probabilities

much lower than the nominal 95% values !!!!

RDS AS MCMC 9

A B

Figure 2. Hypothetical network with an edge betweenevery pair of nodes, where within-group edges havehigher weight than between-group edges. Here the twogroups are defined by infection status, and a bottleneckexists between healthy and infected individuals. This isthe only type of bottleneck that had been considered inthe previous RDS literature.

that as long as there were su⇥cient connections between infected anduninfected individuals, the RDS estimates would be reasonably precise.While this structural feature is certainly a concern, taken in isolation itunderestimates the e�ect of network structure on the variance of RDSestimates. Even when infected and uninfected individuals are relativelywell connected, bottlenecks in other parts of the network can lead tolarge variance.

To illustrate this point, we analyze RDS on two network models indetail. Our examples, while motivated by the qualitative features ofreal social networks, are not intended to be accurate models of anyspecific social network. Rather, they provide insight by allowing forexact and interpretable results.

3.1. Two Network Models.

3.1.1. A Two-Group Model. Consider a population V consisting of twogroups, A and B, of equal size N/2. Edges exist between every pair ofindividuals, however within-group edges have weight 1�c while between-group edges have weight c where 0 < c < 1/2 (see Figure 3(a)).12

That is, within-group relationships are stronger than between-grouprelationships. In this model, c parameterizes homophily based on groupmembership—the well-observed social tendency for people to form ties

12This network model allows for self-edges which means that it allows for self-recruitment during the sampling process. This assumption departs from the actualRDS sampling process, but has minimal e�ect on the qualitative results.

Page 18: 18-Networks2

To consult a statistician after an experiment is finished is often merely to ask him to conduct a post-

mortem examination. He can perhaps say what the experiment died of.

-- R.A Fisher

What would Fisher say?

Page 19: 18-Networks2

To Model or Not To Model; Design-based vs. model-based

• Model the underlying network? What about unknown nodes? • the recruitment process? • coupon refusal? • the outcome variables (such as HIV status)?