preserving link privacy in social network based...

Preserving Link Privacy in Social Network Based Systems

Prateek MittalUniversity of California,

[email protected]

Charalampos PapamanthouUniversity of California,

[email protected]

Dawn SongUniversity of California,

[email protected]

Abstract

A growing body of research leverages social networkbased trust relationships to improve the functionality ofthe system. However, these systems expose users’ trustrelationships, which is considered sensitive informationin today’s society, to an adversary.

In this work, we make the following contributions.First, we propose an algorithm that perturbs the struc-ture of a social graph in order to provide link privacy,at the cost of slight reduction in the utility of the so-cial graph. Second we define general metrics for char-acterizing the utility and privacy of perturbed graphs.Third, we evaluate the utility and privacy of our pro-posed algorithm using real world social graphs. Finally,we demonstrate the applicability of our perturbation al-gorithm on a broad range of secure systems, includingSybil defenses and secure routing.

1 Introduction

In recent years, several proposals have been put for-ward that leverage user’s social network trust relation-ships to improve system security and privacy. So-cial networks have been used for Sybil defense [36,35, 6, 23, 31], secure routing [14, 21, 17], securereputation systems [32], mitigating spam [19], cen-sorship resistance [30], and anonymous communica-tion [5, 25, 20, 22].

A significant barrier to the deployment of these sys-tems is that they do not protect the privacy of user’strusted social contacts. Information about user’s trustrelationships is considered sensitive in today’s society;in fact, existing online social networks such as Face-book, Google+ and Linkedin provide explicit mecha-nisms to limit access to this information. A recentstudy by Dey et al. [7] found that more than 52% ofFacebook users hide their social contact information.

Most protocols that leverage social networks for sys-tem security and privacy either explicitly reveal users’trust relationships to an adversary [6] or allow the ad-versary to easily perform traffic analysis and infer thesetrust relationships [35]. Thus the design of these sys-tems is fundamentally in conflict with the current on-line social network paradigm, and hinders deployment.

In this work, we focus on protecting the privacy ofusers’ trusted contacts (edge/link privacy, not vertexprivacy) while still maintaining the utility of higherlevel systems and applications that leverage the socialgraph. Our key insight in this work is that for a largeclass of security applications that leverage social rela-tionships, preserving the exact set of edges in the graphis not as important as preserving the graph-theoreticstructural differences between the honest and dishonestusers in the system.

This insight motivates a paradigm of structuredgraph perturbation, in which we introduce noise in thesocial graph (by deleting real edges and introducingfake edges) such that the local structures in the orig-inal social graph are preserved. We believe that formany applications, introducing a high level of noise insuch a structured fashion does not reduce the overallsystem utility.

1.1 Contributions

In this work, we make the following contributions.

• First, we propose a mechanism based on randomwalks for perturbing the structure of the socialgraph that provides link privacy at the cost of aslight reduction in application utility (Section 4).

• We define a general metric for characterizing theutility of perturbed graphs. Our utility definitionconsiders the change in graph structure from theperspective of a vertex. We formally relate our no-tion of utility to global properties of social graphs,such as mixing times and second largest eigenvalue

modulus of graphs, and analyze the utility prop-erties of our perturbation mechanism using realworld social networks (Section 5).

• We define several metrics for characterizing linkprivacy, and consider prior information that anadversary may have for de-anonymizing links. Wealso formalize the relationship between utility andprivacy of perturbed graphs, and analyze the pri-vacy properties of our perturbation mechanism us-ing real world social networks (Section 6).

• Finally, we experimentally demonstrate the realworld applicability of our perturbation mechanismon a broad range of secure systems, including Sybildefenses and secure routing (Section 7). In fact,we find that for Sybil defenses, our techniques areof interest even outside the context of link privacy.

2 Related Work

Work in this space can be broadly classified into twocategories: (a) mechanisms for protecting the privacyof links between labeled vertices, and (b) mechanismsfor protecting node/graph privacy when vertices areunlabeled.

For many real world applications relying on socialnetworks, such as reputation systems, recommendationsystems, Sybil defenses, secure routing, and anonymouscommunications, it is critical for a user to know theidentities of the vertices in the social graph that it in-teracts with. Protocols that preserve vertex privacy(unlabeled vertices) would not be appropriate for suchapplications. On the other hand, these applicationscould reveal sensitive social contacts to an adversary;thus the focus of this work is on protecting the privacyof links between labeled vertices.

2.1 Link privacy between labeled vertices

There are two main mechanisms for preserving linkprivacy between labeled vertices. The first approach isto perform clustering of vertices and edges, and aggre-gate them into super vertices (e.g., [11] and [37]). Inthis way, information about corresponding sub-graphscan be anonymized. While these clustering approachespermit analysis of some macro-level graph properties,they are not suitable for black-box application of ex-isting social network based applications, such as Sybildefenses. The second class of approaches aim to intro-duce perturbation in the social graph by adding anddeleting edges and vertices. Next, we discuss this lineof research in more detail.

Hay et al. [12] propose a perturbation algorithmwhich applies a sequence of k edge deletions followedby k random edge insertions. Candidates for edge dele-tion are sampled uniformly at random from the spaceof existing edges in graph G, while candidates for edgeinsertion are sampled uniformly at random from thespace of edges not in G. The key difference betweenour perturbation mechanism and that of Hay et al. isthat we sample edges for insertion based on the struc-ture of the original graph (as opposed to random selec-tion). We will compare our approach with that of Hayet al. in Section 6.

Ying and Wu [34] study the impact of Hay et al.’sperturbation algorithms [12] on the spectral propertiesof graphs, as well as on link privacy. They also proposea new perturbation algorithm that aims to preserve thespectral properties of graphs, but do not analyze itsprivacy properties.

Korolova et al. [13] show that link privacy of theoverall social network can be breached even if infor-mation about the local neighborhood of social networknodes is leaked (for example, via a look-ahead featurefor friend discovery).

2.2 Anonymizing the vertices

Although the techniques described above reveal theidentity of the vertices in the social graph but add noiseto the relationships between them, there have been var-ious works in the literature that aim at anonymizingthe identities of the nodes in the social network. Thisline of research is orthogonal to our goals, but we de-scribe them for completeness.

The straightforward approach of just removing theidentifiers of the nodes before publishing the socialgraph does not always guarantee privacy, as shownby Backstrom et. al. [2]. To deal with this problem,Liu and Terzi [15] propose a systematic framework foridentity anonymization on graphs, where they intro-duce the notion of k-degree anonymity. Their goal isto minimally modify the graph by changing the degreesof specially-chosen nodes so that the identity of eachindividual involved is protected. An efficient versionof their algorithm was recently implemented by Lu etal. [16].

Another notion of graph anonymity in social net-works is presented by Pei and Zhou [38]: A graph isk-anonymous if for every node there exist at least k−1other nodes that share isomorphic neighborhoods. Thisis a stronger definition than the one in [15], where onlyvertex degrees are considered.

Zhou and Pei [39] recently introduced another no-tion called l-diversity for social network anonymiza-

tion. In this case, each vertex is associated with somenon-sensitive attributes and some sensitive attributes.Maintaining the privacy of the individual in this sce-nario is based on the adversary not being able (withhigh probability) to re-identify the sensitive attributevalues of the individual.

Finally, Narayanan and Shmatikov [26] show someof the weaknesses of the above anonymization tech-niques, propose a generic way for modeling the releaseof anonymized social networks and report on successfulde-anonymization attacks on popular networks such asFlickr and Twitter.

2.3 Differential privacy and social networks

Sala et al. [29] use differential privacy (a more elab-orate tool of adding noise) to publish social networkswith privacy guarantees. Given a social network and adesired level of differential privacy guarantee, they ex-tract a detailed structure into degree correlation statis-tics, introduce noise into these statistics, and use themto generate a new synthetic social network with differ-ential privacy. However, their approach does not pre-serve utility of the social graph from the perspective ofa vertex (since vertices in their graph are unlabeled),and thus cannot be used for many real world applica-tions such as Sybil defenses.

Also, Rastogi et al. [28] introduce a relaxed notion ofdifferential privacy for data with relationships so thatmore expressive queries (e.g., joins) can be supportedwithout significant harm to utility.

2.4 Link privacy preserving applications

X-Vine [21] proposes to perform DHT routing us-ing social links, in a manner that preserves the privacyof social links. However, the threat model in X-Vineexcludes adversaries that have prior information aboutthe social graph. Thus in real world settings, X-Vineis vulnerable to the Narayanan-Shmatikov attack [26].Moreover the techniques in X-Vine are specific to DHTrouting, and cannot be used to design a general pur-pose defense mechanism for social network based ap-plications, which is the focus of this work.

3 Basic Theory

Before we introduce our mechanism, we presentsome necessary notation and background on graph the-ory.

Let us denote the social graph as G = (V,E), com-prising the set of vertices V (wlog assume the verticeshave labels 1, . . . , n), and the set of edges E, where

|V | = n and |E| = m. The focus of this paper is onundirected graphs, where the edges are symmetric. LetAG denote the n × n adjacency matrix correspondingto the graph G, namely if (i, j) ∈ E, then Aij = 1,otherwise Aij = 0.

A random walk on a graph G starting at a vertex vis a sequence of vertices comprising a random neighborv1 of v, then a random neighbor v2 of v1 and so on. Arandom walk on a graph can be viewed as a Markovchain. We denote the transition probability matrix ofthe random walk/Markov chain as P , given by:

Pij =

{1

deg(i) if (i, j) is an edge in G ,

0 otherwise .(1)

where deg(i) denotes the degree of the vertex i. Atany given iteration t of the random walk, let us de-note with π(t) the probability distribution of the ran-dom walk state at that iteration (π(t) is a vector ofn entries). The state distribution after t iterations isgiven by π(t) = π(0) ·P t, where π(0) is the initial statedistribution. The probability of a t-hop random walkstarting from i and ending at j is given by (P t)ij . Fromnow on, to simplify the notation, we write (P t)ij as P tijin order to denote the element (i, j) of the matrix P t.

For irreducible and aperiodic graphs (which undi-rected and connected social graphs are), the corre-sponding Markov chain is ergodic, and the state distri-bution of the random walk π(t) converges to a uniquestationary distribution denoted by π. The stationarydistribution π satisfies π = π · P .

For undirected and connected social graphs, we can

see that the probability distribution πi = deg(i)2·m satisfies

the equation π = π·P , and is thus the unique stationarydistribution of the random walk.

Let us denote the eigenvalues of A as λ1 ≥ λ2 ≥. . . ≥ λn, and the eigenvalues of P as ν1 ≥ ν2 ≥ . . . ≥νn. The eigenvalues of both A and P are real. Wedenote the second largest eigenvalue modulus (SLEM)of the transition matrix P as µ = max(|ν2|, |νn|). Theeigenvalues of matrices A and P are closely related tostructural properties of graphs, and are considered util-ity metrics in the literature.

4 Structured Perturbation

4.1 System model and goals

For the deployment of secure applications that lever-age user’s trust relationships, we envision a scenariowhere these applications bootstrap user’s trust rela-tionships using existing online social networks (OSNs)such as Facebook or Google+.

However, most applications that leverage this infor-mation do not make any attempt to hide it; thus anadversary can exploit protocol messages to learn theentire social graph.

Our vision is that OSNs can support these appli-cations while protecting the link privacy of users byintroducing noise in the social graph. Of course theaddition of noise must be done in a manner that stillpreserves application utility. Moreover the mechanismfor introducing noise should be computationally effi-cient, and must not present undue burden to the OSNoperator.

We need a mechanism that takes the social graphG as an input, and produces a transformed graphG′ = (V,E′), such that the vertices in G′ remain thesame as the original input graph G, but the set of edgesis perturbed to protect link privacy. The constrainton the mechanism is that application utility of systemsthat leverage the perturbed graph should be preserved.Conventional metrics of utility include degree sequenceand graph eigenvalues; we will shortly define a generalmetric for utility of perturbed graphs in the followingsection. There is a tradeoff between privacy of linksin the social graph and the utility derived out of per-turbed graphs. As more and more noise is added tothe social graph, the link privacy increases, but thecorresponding utility decreases.

4.2 Perturbation algorithm

We propose the paradigm of structured graph pertur-bation, in which we introduce noise in the social graph(by deleting real edges and introducing fake edges) suchthat the local structures in the original social graph arepreserved.

Let t be the parameter that governs how much noisewe wish to inject in the social graph. We propose thatfor each node u in graph G, we perturb all of u’s con-tacts as follows. Suppose that node v is a social contactof node u. Then we perform a random walk of lengtht− 1 starting from node v. Let node z denote the ter-minus point of the random walk. Our main idea is thatinstead of the edge (u, v), we will introduce the edge(u, z) in the graph G′. It is possible that the randomwalk terminates at either node u itself, or that node z isalready a social contact of u in the transformed graphG′ (due to a previously added edge). To avoid self loopsand duplicate edges, we perform another random walkfrom vertex v until a suitable terminus vertex is found,or we reach a threshold number of tries, denoted byparameter M .

For undirected graphs, the algorithm described sofar would double the number of edges in the perturbed

graphs: for each edge (u, v) in the original graph, anedge would be added between a vertex u and the ter-minus point of the random walk from vertex v, as wellas between vertex v and the terminus point of the ran-dom walk from vertex u. To preserve the degree dis-tribution, we could add an edge between vertex u andvertex z in the transformed graph with probability 0.5.However this could lead to low degree nodes becom-ing disconnected from the social graph with non-trivialprobability. To account for this case, we add the firstedge corresponding to the vertex u with probability1, while the remaining edges are accepted with a re-duced probability to preserve the degree distribution.The overall algorithm is depicted in Algorithm 1. Thecomputational complexity of our algorithm is O(m).

Algorithm 1 Transform(G, t,M): Perturb undirectedgraph G using perturbation t and maximum loop countM .G′ = null;foreach vertex u in G

let count = 1;foreach neighbor v of vertex ulet loop = 1;doperform t− 1 hop random walk from vertex v;let z denote the terminal vertex of the random

walk;loop+ +;

until (u = z ∨ (u, z) ∈ G′) ∧ (loop ≤M)if loop ≤Mif count = 1add edge (u, z) in G′

elselet deg(u) denote degree of u in G;

add edge (u, z) in G′ with probability0.5×deg(u)−1

deg(u)−1 ;count+ +;

return G′;

4.3 Visual depiction of algorithm

For our evaluation, we consider two real world so-cial network topologies (a) Facebook friendship graphfrom the New Orleans regional network [33] : thedataset comprises 63,392 users that have 816,886 edgesamongst them, and (b) Facebook interaction graphfrom the New Orleans regional network [33] : thedataset comprises 43,953 users that have 182,384 edgesamongst them. Mohaisen et al. [24] found that pre-processing social graphs to exclude low degree nodessignificantly changes the graph theoretic characteris-

(a) (b) (c)

(d) (e)

Figure 1. Facebook dataset link topology (a) Original graph (b) Perturbed, t = 5 (c) t = 10 (d) t = 15,and (e) t = 20. The color coding in (a) is derived using a modularity based community detectionalgorithm. For the remaining figures, the color coding of vertices is same as in (a). We can seethat short random walks preserve the community structure of the social graph, while introducing asignificant amount of noise.

tics. Therefore, we did not pre-process the datasets atall.

Figure 1 depicts the original Facebook friendshipgraph, and the perturbed graphs generated by our al-gorithm for varying perturbation parameters, using aforce directed algorithm for depicting the graph [3].The color coding of nodes in the figure was obtained byrunning a modularity based community detection al-gorithm [3] on the original Facebook friendship graph,which yielded three communities. For the perturbedgraphs, we used the same color for the vertices as inthe original graph. This representation allows us to vi-sually see the perturbation in the community structureof the social graph. We can see that for small values ofthe perturbation parameter, the community structure(related to utility) is strongly preserved, even thoughthe edges between vertices are randomized. As the per-turbation parameter is increased, the graph looses itscommunity structure, and eventually begins to resem-ble a random graph.

Figure 2 depicts a similar visualization for the Face-book interaction graph. In this setting, we found two

communities using a modularity based community de-tection algorithm in the original graph. We can see asimilar trend in the Facebook interaction graph as well:for small values of perturbation algorithm, the com-munity structure is somewhat preserved, even thoughsignificant randomization has been introduced in thelinks. In the following sections, we formally quantifythe utility and privacy properties of our perturbationmechanism.

5 Utility

In this section, we develop formal metrics to charac-terize the utility of perturbed graphs, and then analyzethe utility of our perturbation algorithm.

5.1 Metrics

One approach to measure utility would be to con-sider global graph theoretic metrics, such as the secondlargest eigenvalue modulus of the graph transition ma-trix P . However, from a user perspective, it may be the

(a) (b) (c)

(d) (e)

Figure 2. Facebook dataset interaction topology (a) Original graph (b) Perturbed t = 5 (c) t = 10 (d)t = 15, and (e) t = 20. We can see that short random walks preserve the community structure of thesocial graph, while introducing a significant amount of noise.

case that the users’ position in the perturbed graph rel-ative to malicious users is much worse, even though theglobal graph properties remain the same. This moti-vates our first definition of utility of a perturbed graphfrom the perspective of a single user.

Definition 1. The vertex utility of a perturbed graphG′ for a vertex v, with respect to the original graphG and an application parameter l, is defined as thestatistical distance between the probability distributionsinduced by l hop random walks starting from v in graphsG and G′, i.e.,

VU(v,G,G′, l) = distance(P lv(G), P lv(G′)) ,

where P lv denotes the v-th row of the matrix P l.

In the above definition, the parameter l is linked tohigher level applications that leverage social graphs.For example, Sybil defense mechanisms exploit largescale community structure of social networks, wherethe application parameter l ≥ 10. For applicationssuch as recommendation systems, it may be more im-portant to preserve the local community characteris-tics, where l could be set to a smaller value.

Random walks are intimately linked to the structureof communities and graphs, so it is natural to considertheir use when defining utility of perturbed graphs.In fact, a lot of security applications directly exploitthe trust relationships in social graphs by performingrandom walks themselves, such as Sybil defenses andanonymous communication.

There are several ways to define statistical distancebetween probability distributions P and Q [4] (assumeP and Q are vectors of n entries summing up to one).In this work, we consider the following three notions.

The total variation distance between two probabilitydistributions is a measure of the maximum differencebetween the probability distributions for any individualelement. Namely

distancev(P,Q) = ||P −Q||tvd = supi|pi − qi| .

As we will discuss shortly, the total variation distanceis closely related to the computation of several graphproperties such as mixing time and second largesteigenvalue modulus. However, the total variation dis-tance only considers the maximum difference betweenprobability distributions corresponding to an element,

and not the differences in probabilities correspondingto other elements of the distribution. This motivatesthe use of Hellinger distance, defined as

distanceh(P,Q) =1√2·

√√√√ n∑i=1

(√pi −

√qi)2 .

The Hellinger distance is related to the Euclideandistance between the square root vectors of P andQ. Finally, we also consider the Jenson-Shannondistance measure, which takes an information theo-retic approach of averaging the Kullback-Leibler diver-gence between P and Q, and between Q and P (sinceKullback-Leibler divergence by itself is not symmetric)and is defined as

distancej(P,Q) =1

2·n∑i=1

pi log

(piqi

)+

1

2·n∑i=1

qi log

(qipi

).

Using these notions, we can compute the utilityof the perturbed graph with respect to an individualvertex (vertex utility). Note that a lower value ofVU(v,G,G′, l) corresponds to higher utility (we wantdistance between probability distributions over originalgraph and perturbed graph to be low). Using the con-cept of vertex utility, we can define metrics for overallutility of a perturbed graph.

Definition 2. The overall mean vertex utility of a per-turbed graph G′ with respect to the original graph G,and an application parameter l is defined as the meanutility for all vertices in G, i.e.,

VUmean(G,G′, l) =∑v∈V

distance(P lv(G), P lv(G′))

|V |.

Similarly the maximum vertex utility (worst case) of aperturbed graph G′ is defined by computing the maxi-mum of the utility values over all vertices in G, i.e.,

VUmax(G,G′, l) = maxv∈V{distance(P lv(G), P lv(G

′))} .

The notion of maximum vertex utility is particularlyinteresting, specially in conjunction with the use of to-tal variation distance. This is because of its relation-ship to global graph metrics such as mixing times andsecond largest eigenvalue modulus, which we demon-strate next. Our analysis shows the generality of ourformal definition for utility.

5.2 Metrics analysis

Towards this end, we first introduce the notion ofmixing time of a Markov process. The mixing time of a

Markov process is a measure of the minimum numberof steps needed to converge to its unique stationarydistribution. Formally, the mixing time of a graph G =(V,E) is defined as

τG(ε) = maxv∈V{mint>0{t : |P tv(G)− π| < ε}} ,

where π is the stationary distribution defined in Sec-tion 3. The following two theorems illustrate the boundon global properties of the perturbed graph, using theglobal properties of the original graph, and the utilitymetric. We defer the proofs of these theorems to theAppendix.

Theorem 1. Let VUmax(G,G′, l) be the maximumvertex utility distance between the perturbed graphG′ and the original graph G. Then τG′(ε +VUmax(G,G′, τG(ε)) ≤ τG(ε).

Theorem 1 relates the mixing time of the perturbedgraph using the mixing time of the original graph, andthe max vertex utility metric, for application parame-ter l = τG(ε).

Theorem 2. Let µG be the second largest eigenvaluemodulus (SLEM) of transition matrix PG of graphG. We can bound the SLEM µG′ of a perturbedgraph G′ using the mixing time τG(ε) of the origi-nal graph and the worst case vertex utility distanceχ = VUmax(G,G′, τG(ε)) between G and G′. Namelywe have

µG′ ≤ 2τG(ε)

2τG(ε) + log(

12ε+2χ

) .Theorem 2 relates the second largest eigenvalue

modulus of the perturbed graph, using the mixing timeof the original graph, and the worst case vertex utilitymetric for application parameter l = τG(ε).

These theorems show the generality of our util-ity definitions. Mechanisms that provide good utility(have low values of VUmax), introduce only a smallchange in the mixing time and SLEM of perturbedgraphs.

5.3 Algorithm analysis

Our above results show the general relationship be-tween our utility metrics and global graph properties(which hold for any perturbation algorithm). Next, weanalyze the properties of our proposed perturbation al-gorithm.

First, we empirically compute the mean vertex util-ity of the perturbed graphs (VUmean), for varying per-turbation parameters and varying application param-eters. Figure 3 depicts the mean vertex utility for the

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 4 6 8 10 12 14 16 18 20 22

Jenso

n-S

hannon D

ista

nce

w

rt O

rig

inal G

rap

h

Transient Random Walk Length

t=2t=3t=4t=5t=10

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 4 6 8 10 12 14 16 18 20 22

Jenso

n-S

hannon D

ista

nce

w

rt O

rig

inal G

rap

h


t=2t=3t=4t=5t=10

(b)

Figure 3. Jenson-Shannon distance between transient probability distributions for original graph and trans-formed graph using (a) Facebook interaction graph (b) Facebook link graph. We can see that asthe original graph is perturbed to a larger degree, the distance between original and transformedtransient distributions increases, decreasing application utility.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

2 4 6 8 10 12 14 16 18 20 22

Helli

ng

er

Dis

tance

w

rt O

rig

inal G

rap

h


t=2t=3t=4t=5t=10

(a)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

2 4 6 8 10 12 14 16 18 20 22

Helli

ng

er

Dis

tance

w

rt O

rig

inal G

rap

h


t=2t=3t=4t=5t=10

(b)

Figure 4. Hellinger distance between transient probability distributions for original graph and transformedgraph using (a) Facebook interaction graph (b) Facebook link graph. We can see that even with differ-ent notions of distance between probability distributions (Hellinger/Jenson-Shannon), the distancebetween original graph and perturbed graph monotonically increases depending on the perturbationdegree.

Facebook interaction and friendship graphs using theJenson-Shannon information theoretic distance metric.We can see that as the perturbation parameter in-creases, the distance metric increases. This is not sur-prising, since additional noise will increase the distancebetween probability distributions computed from orig-inal and perturbed graphs. We can also see that as theapplication parameter l increases, the distance metricdecreases. This illustrates that our perturbation algo-rithm is ideally suited for security applications that relyon local or global community structures, as opposed toapplications that require exact information about oneor two hop neighborhoods. We can see a similar trendwhen using Hellinger distance to compute the distancebetween probability distributions, as shown in Figure 4.

Theorem 3. The expected degree of each node afterthe perturbation algorithm is the same as in the originalgraph, i.e., ∀v ∈ V , it holds that E(deg′(v)) = deg(v),where deg′(v) denotes the degree of vertex v in G′.

Proof. On an expectation, half the degree of any node

v is preserved via outgoing random walks from v in theperturbation process. To prove the theorem, we need toshow that each node v is the terminal point of deg(v)/2random walks in the perturbation mechanisms (on av-erage). From the time reversibility property of therandom walks, we have that P tijπi = P tjiπj . Thus forany node i, the incoming probability of a random walk

starting from node j is P tji = P tijdeg(i)deg(j) , i.e., it is pro-

portional to the node degree of i. Thus the expectednumber of random walks terminating at node i in theperturbation algorithm is given by

∑v∈V deg(v)P t−1vi .

This is equivalent to∑v∈V P

t−1iv deg(i) = deg(i). Since

half of these walks will be added to the graph G′ onaverage, we have that E(deg′(v)) = deg(v).

Corollary 1. The expected value of the largesteigenvalue of the transformed graph is bounded asmax{davg,

√dmax} ≤ E(λ′1) ≤ dmax, where davg and

dmax are the average and maximum degrees respec-tively.

1

10

100

1000

0 10000 20000 30000 40000

Nod

e D

eg

ree

Node index

Originalt=5t=10

(a)

1

10

100

1000

10000

0 20000 40000 60000

Nod

e D

eg

ree

Node index

Originalt=5t=10

(b)

Figure 5. Degree distribution of nodes using (a) Facebook interaction graph (b) Facebook link graph.We can see that the expected degree of each node after the perturbation process remains the sameas in the original graph.

0.001

0.01

0.1

1

0 5 10 15 20 25 30 35 40 45 50

Vari

ati

on D

ista

nce

Random Walk Length

t=Originalt=3t=5t=15t=20

(a)

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 5 10 15 20 25 30 35 40 45 50

Vari

ati

on D

ista

nce

Random Walk Length

t=Originalt=3t=5t=15t=20

(b)

Figure 6. Total variation distance as a function of random walk length using (a) Facebook interaction graph(b) Facebook link graph. We can see that increasing the perturbation parameter of our algorithmreduces the mixing time of the graph.

From the Perron-Frobenius theorem, we have thatthe largest eigenvalue of the graph is related to thenotion of average graph degree as follows,

max{d′avg,√d′max} ≤ λ′1 ≤ d′max .

Taking expectation on the above equation, and usingthe previous theorem yields the corollary.

Next, we show experimental results validating ourtheorem. Figure 5 depicts the node degrees of the origi-nal graphs, and expected node degrees of the perturbedgraphs, corresponding to all nodes in the Facebookinteraction and friendship graphs. In this figure, thepoints in a vertical line for different perturbed graphscorrespond to the same node index. We can see thatthe degree distributions are nearly identical, validatingour theoretical results.

Theorem 4. Using our perturbation algorithm, themixing time of the perturbed graphs is related to the

mixing time of the original graph as follows: τG(ε)t ≤

E(τG′(ε)) ≤ τG(ε).

Theorem 4 bounds the mixing time of the perturbedgraph using the mixing time of the original graph and

the perturbation parameter t. We defer the proof tothe Appendix. Finally, we compute the mixing time ofthe original and perturbed graphs using simulations.Figure 6 depicts the total variation distance betweenrandom walks of length x and the stationary distri-bution, for the original and perturbed graphs.1 Wecan see that as the perturbation parameter increases,the total variation distance (and the mixing time) de-creases. Moreover, for small values of the perturbationparameter, the difference from the original topology issmall. As an aside, it is interesting to note that thevariation distance for the Facebook friendship graph isorders of magnitude smaller that the Facebook inter-action graph. This is because the Facebook interactiongraphs are a lot sparser, resulting in slow mixing.

6 Privacy

We now address the question of understanding linkprivacy of our perturbation algorithm. We use sev-

1Variation distance has a slight oscillating behavior at oddand even steps of the random walk; this phenomenon is alsoobserved in Figure 8.

eral notions for quantifying link privacy, which fallinto two categories (a) quantifying exact probabilitiesof de-anonymizing a link given specific adversarial pri-ors, and (b) quantifying risk of de-anonymizing a linkwithout making specific assumptions about adversarialpriors. We also characterize the relationship betweenutility and privacy of a perturbed graph.

6.1 Bayesian formulation for link privacy

Definition 3 (Link privacy). Let G be the originalgraph and L be a link (edge) belonging to G. Let G′

be the perturbed graph, as computed by our algorithm.The adversary is given G′ and some prior (auxiliary)information H. The privacy of link L is the probabilityPr[L ∈ G|G′, H] . Namely, we define the privacy of alink L (or a subgraph) as the probability of existenceof the link (or a subgraph) in the original graph G un-der the assumption that the adversary has access to theperturbed graph G′ and prior information H. We writethis probability as Pr[L|G′, H].

Note that low values of link probability Pr[L|G′, H]correspond to high privacy. We cast the problem ofcomputing the link probability as a Bayesian inferenceproblem. Using Bayes theorem, we have that:

Pr[L|G′, H] =Pr[G′|L,H] · Pr[L|H]

Pr[G′|H].

In the above expression, Pr[L|H] is the prior prob-ability of the link. In Bayesian inference, Pr[G′|H] isa normalization constant that is typically difficult tocompute, but this is not an impediment for the analysissince sampling techniques can be used (as long as thenumerator of the Bayesian formulation is computableupto a constant factor [18, 10]). Our key theoreticalchallenge is to compute Pr[G′|L,H].

To compute Pr[G′|L,H], the adversary has to con-sider all possible graphs Gp, which have the link L, andare consistent with background information H. Thus,we have that:

Pr[G′|L,H] =∑Gp

Pr[G′|Gp] · Pr[Gp|L,H] . (2)

The adversary can compute Pr[G′|Gp] using theknowledge of the perturbation algorithm; we assumethat the adversary knows the full details of our pertur-bation algorithm, including the perturbation parame-ter t. Observe that given Gp, edges in G′ can be mod-eled as samples from the probability distribution of thop random walks from vertices in Gp. Thus we can

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

0.0001 0.001 0.01 0.1 1

CD

F

Link Probability

t=2t=3t=4t=5t=10

Figure 7. Cumulative distribution of link probabil-ity Pr[L|G′, H] (x-axis is logscale) under worstcase prior H = G− L using a synthetic scalefree topology. Small probabilities offer higherprivacy protection.

compute Pr[G′|Gp] by using the t hop transition prob-abilities of vertices in Gp as:(

1

2

)2m

·(

2m

m′

)·

∏(i,j)∈E(G′)

P tij(Gp) + P tji(Gp)

2.

In general, the number of possible graphs Gp thathave L as a link and are consistent with the adversary’sbackground information can be very large, and thecomputation in Equation 2 then becomes intractable.

For evaluation, we consider a special case of this defi-nition: the adversary’s prior is the entire original graphwithout the link L (which is the link for which we wantto quantify privacy). Observe that this is a very pow-erful adversarial prior; we use this prior to shed lighton the worst-case link privacy using our perturbationalgorithm. Under this prior, we have that:

Pr[L|G′, G− L] =Pr[G′|L,G− L] · Pr[L|G− L]

Pr[G′|G− L]

=Pr[G′|G] · Pr[L|G− L]

Pr[G′|G− L]

=Pr[G′|G] · Pr[L|G− L]∑

l Pr[G′|G− L+ l]. (3)

Using G−L as the adversarial prior constraints theset of possible Gp to a polynomial number. However,even in this setting, we found that the above compu-tation is computationally expensive (> O(n3)) usingour real world social networks. Thus to get an un-derstanding of link privacy in this setting,we gener-ated a 500 node synthetic scale-free topology using thepreferential attachment methodology of Nagaraja [25].The parameters of the scale free topology was set us-ing the average degree in the Facebook interaction

graph. Figure 7 depicts the cumulative distribution forlink probability (probability of de-anonymizing a link,Pr[L|G′, H]) in this setting (worst case prior) for thesynthetic scale-free topology. We can see that there issignificant variance in the privacy protection receivedby links in the topology: for example, using pertur-bation parameter t = 2, 40% of the links have a linkprobability less than 0.1 (small is better), while 30% ofthe links have have a probability of 1 (can be fully de-anonymized). Note that this is a worst case analysis,since we assume that an attacker knows the entire orig-inal graph except the link in question. Furthermore,even in this setting, as the perturbation parameter tincreases, the privacy protection received by links sub-stantially improves: for example, using t = 5, all of thelinks have a link probability less than 0.01, while 70%of the links have a link probability of less than 0.001.Thus we can see that even in this worst case analysis,our perturbation mechanism offers good privacy pro-tection.

We now compare with the work of Hay et al [12],which proposes a perturbation approach where k realedges are deleted and k fake edges are introduced atrandom. Even considering k = m/2 (which would sub-stantially hurt utility), 50% of the edges in the per-turbed graph are real edges between users; for theseedges, Pr[L|G′] = 0.5. Thus we can see that our pertur-bation mechanism significantly improves privacy com-pared with the work of Hay et al.

6.2 Relationship between privacy and utility

Intuitively, there is a relationship between link pri-vacy and the utility of the perturbed graphs. Next, weformally quantify this relationship.

Theorem 5. Let the maximum vertex utility of thegraph (over all vertices) corresponding to an applica-tion parameter l be VUmax(G,G′, l). Then for any twovertices A and B, we have that Pr[LAB |G′] ≥ f(δ),where f(δ) denotes the prior probability of two verticesbeing friends given that they are both contained in a δhop neighborhood, and δ is computed as δ = min{k :P kAB(G′)− VUmax(G,G′, k) > 0}.

Utility measures the change in graph structure be-tween original and perturbed graphs. If this change issmall (high utility), then an adversary can infer infor-mation about the original graph given the perturbedgraph. For a given level of utility, the above theoremdemonstrates a lower bound on link privacy.

We defer the proof of the above theorem to the Ap-pendix. The above theorem is a general theorem thatholds for all perturbed graphs. To shed some intuition,

we specifically analyze the lower bounds on privacy forour perturbation algorithm (where parameter t gov-erns utility), using the transition probability betweenA and B in the perturbed graph G′ as a feature to as-sign probabilities to nodes A and B of being friends inthe original graph G. We are interested in the quan-tity Pr[LAB | P kAB(G′) > x], for different values ofk. Using Bayes’ theorem, we have that the probabilityPr[LAB | P kAB(G′) > x] can be written as

Pr[P kAB(G′) > x | LAB ] · Pr[LAB ]

Pr[P kAB(G′) > x]. (4)

Moreover, we have that

Pr[P kAB(G′) > x] = Pr[P kAB(G′) > x | LAB ] · Pr[LAB ]

+ Pr[P kAB(G′) > x | LAB ] · Pr[LAB ] ,

where LAB denotes the event when vertices A and Bdon’t have a link.

To get some insight, we computed the probabilitydistributions Pr[P kAB(G′)|LAB ] and Pr[P kAB(G′)|LAB ]using simulations on our real world social networktopologies. Figure 8 depicts the median value of therespective probability distributions, as a function ofparameter k, for different perturbation parameters t,using the Facebook interaction graph. We can see thatthe median transition probabilities are higher for thescenario where two users are originally friends (as op-posed to the setting where they are not friends). Wecan also see that the difference between median transi-tion values in the two scenarios is higher when (a) theperturbation parameter t is small, and (b) the parame-ter K (random walk length) is small. This difference isrelated to the privacy of the link—the larger the differ-ence, the greater the loss in privacy. The insight fromthis figure is that for small perturbation parameters,the closer two nodes are to each other in the perturbedgraph, the higher their chances of being friends in theoriginal graph.

Next, we consider the full distribution of transitionprobabilities (as opposed to only the median values dis-cussed above). Figure 9(a) depicts the complimentarycumulative distribution (P kAB(G′) > x) using pertur-bation parameter t = 2 for the Facebook interactiongraph. Again, we can see that when A and B arefriends, they have higher transition probability to eachother in the perturbed graph, compared to the scenariowhen A and B are not friends. Moreover, as the valueof k increases, the gap between the distributions be-comes smaller. Similarly, as the value of t increases,the gap between the distributions becomes smaller, asdepicted in Figures 9(b-c). We can use these simula-tion results to compute the probabilities in Equation 4.

1e-07

1e-06

1e-05

0.0001

0.001

2 4 6 8 10 12 14 16 18 20

Med

ian T

ransi

tion P

rob

ab

ility


Friends, t=2Friends, t=3Friends, t=4Friends, t=5Friends, t=10

(a)

1e-07

1e-06

1e-05

0.0001

0.001

2 4 6 8 10 12 14 16 18 20

Med

ian T

ransi

tion P

rob

ab

ility


Random, t=2Random, t=3Random, t=4Random, t=5Random, t=10

(b)

Figure 8. Median transition probability between two vertices in transformed graph when (a) two verticeswere neighbors (friends) in the original graph and (b) two vertices were not neighbors in the originalgraph, for the Facebook interaction graph.

1e-05

0.0001

0.001

0.01

0.1

1

1e-05 0.0001 0.001 0.01 0.1 1

Com

plim

enta

ry C

DF

Transition Probability

k=1, friendsk=1, not friendsk=2, friendsk=2, not friends

(a)

1e-05

0.0001

0.001

0.01

0.1

1

1e-05 0.0001 0.001 0.01 0.1 1

Com

plim

enta

ry C

DF



(b)

1e-05

0.0001

0.001

0.01

0.1

1e-05 0.0001 0.001 0.01 0.1 1

Com

plim

enta

ry C

DF



(b)

Figure 9. Complimentary cumulative distribution for P kAB(G′) using (a) perturbation parameter t = 2, and(b) perturbation parameter t = 5, and (c) perturbation parameter t = 10 for the Facebook interactiongraph.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

0.0001 0.001 0.01 0.1 1

CD

F

Link Probability

t=2t=3t=4t=5t=10

Figure 10. Cumulative distribution of link proba-bility Pr[L|G′, H] for the Facebook interactiongraph. Smaller probabilities offer higher pri-vacy protection.

One way to analyzing the lower bound on link privacywould be to choose a uniform prior for vertices beingfriends in the original graph, i.e., Pr[LAB ] = m

(n2). Cor-

respondingly, Pr[LAB ] = 1− Pr[LAB ].

Figure 10 depicts the cumulative distribution of linkprobability computed using the above methodology,for the Facebook interaction graph. We can see thevariance in privacy protection received by links in thetopology. Using t = 2, 80% of the links have a link

probability of less than 0.1. Increasing perturbationparameter t significantly improves performance; usingt = 3, 95% of the links have a link probability less than0.1, and 90% of the links have a link probability lessthan 0.01. In summary, our analysis shows that a givenlevel of utility translates into a lowerbound on privacyoffered by the perturbation mechanism.

6.3 Risk based formulation for link privacy

The dependence of the Bayesian inference based pri-vacy definitions on the prior of the adversary motivatesthe formulation of new metrics that are not specific toadversarial priors. We first illustrate a preliminary def-inition of link privacy (which is unable to account forlinks to degree 1 vertices), and then subsequently im-prove it.

Definition 4 (ε-SI link privacy). We define the struc-tural impact (SI) of a link L in graph G with respect to aperturbation mechanism M, as the statistical distancebetween probability distributions of the output M(G)of the perturbation mechanism (i.e., the set of possibleperturbed graphs) when using (a) the original graph Gas an input to the perturbation mechanism, and (b) thegraph G−L as an input to the perturbation mechanism.

0

0.2

0.4

0.6

0.8

1

0.0001 0.001 0.01 0.1 1

CD

F

ε SI link privacy

t=2t=4t=6t=10

(a)

0

0.2

0.4

0.6

0.8

1

0.0001 0.001 0.01 0.1 1

CD

F

ε SI link privacy

t=2t=4t=6t=10

(b)

Figure 11. Cumulative distribution of ε-SI link privacy for (a) Facebook interaction graph (b) Facebooklink graph. Note that the SI privacy definition is not applicable to links for which either of the vertexhas degree of 1, since removing that link disconnects the graph.

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

CD

F

K anonymity (0.1 SE link privacy)

t=2t=4t=6t=10

(a)

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

CD

F

K anonymity (0.1 SE link privacy)

t=2t=4t=6t=10

(b)

Figure 12. Cumulative distribution of K-anonymity set size for ε = 0.1 SE link privacy using (a) Facebookinteraction graph (b) Facebook link graph.

Let Pr[G′ =M(G)] denote the probability distributionof perturbed graphs G′ using the perturbation mecha-nism M and input graph G. A link has ε-SI privacy ifthe statistical distance

||Pr[G′ =M(G)]− Pr[G′ =M(G− L)]|| < ε .

Intuitively, if the SI privacy of a link is high, thenthe perturbation process leaks more information aboutthat link. On the other hand, if the SI privacy of a linkis low, then the perturbed graph G′ leaks less informa-tion about the link.

As before, we consider the total variation distance asour distance metric between probability distributions.Observe that the links in graphs G′ are samples fromthe probability distribution P tv(G), for v ∈ V . So wecan bound the difference in probability distributions ofperturbed graphs generated from G and G−L, by theworst case total variation distance between probabilitydistributions P tv(G) and P tv(G−L), over all v ∈ V , i.e.,by VUmax(G,G− L, t).

Note that our preliminary attempt at defining linkprivacy above does not accommodate links where ei-ther of its endpoints have degree 1, since removal ofthat link disconnects the graph. This is illustrated inFigure 11, which depicts the cumulative distributionof ε-SI link privacy values. We can see that approx-

imately 30% and 20% of links in the Facebook inter-action graph and friendship graphs respectively do notreceive any privacy protection under this definition (be-cause they are connected to degree 1 vertices). Forthe remaining links, we can see a similar qualitativetrend as before: increasing the perturbation parametert significantly improves link privacy. To overcome thelimitations of this definition, we propose an alternateformulation based on the notion of link equivalence.

Definition 5 (K-anonymous ε-SE link privacy). Wedefine the structural equivalence (SE) between a linkL′ with a link L in graph G with respect to a perturba-tion mechanism M, as the statistical distance betweenprobability distributions of the output of the perturba-tion mechanism, when using (a) the original graph Gas an input to the perturbation mechanism, and (b) thegraph G−L+L′ as the input to the perturbation mech-anism. A link L has K-anonymous ε-SE privacy, ifthere exist at least K links L′, such that

||Pr[G′ =M(G)]− Pr[G′ =M(G− L+ L′)]|| < ε .

Observe that this definition of privacy is able to ac-count for degree 1 vertices, since they can become con-nected to the graph via the addition of other alternatelinks L′. For our experiments, we limited the number

of alternate links explored to 1000 links (for compu-tational tractability). Figure 12 depicts the cumula-tive distribution of anonymity set sizes for links usingε = 0.1 for the Facebook interaction and friendshipgraphs. For t = 2 we see a very similar trend as inthe previous definition, where a non-trivial fraction oflinks do not receive much privacy. Unlike the previoussetting however, as we increase the perturbation pa-rameter t, the anonymity set size for even these linksimproves significantly. Using t = 10, 50% and 70%of the links in the interaction and friendship graphsrespectively, achieved the maximum tested anonymityset size of 1000 links.

6.4 Relation to differential privacy

There is an interesting relation between our risk-based privacy definitions and differential privacy [9]. Adifferentially-private mechanism adaptively adds noiseto the system to ensure that all user records in adatabase (links in our setting) receive a threshold pri-vacy protection. In our mechanism, we are addinga fixed amount of noise (governed by the perturba-tion parameter t), and observing the variance in ε andanonymity set size values.

7 Applications

In this section, we demonstrate the applicability ofour perturbation mechanism to social network basedsystems.

7.1 Secure routing

Several peer-to-peer systems perform routing oversocial graph to improve performance and security,such as Sprout [17], Tribler [27], Whanau [14] and X-Vine [21]. Next, we analyze the impact of our pertur-bation algorithm on the utility of Sprout.

7.1.1 Sprout

Sprout is a routing system that enhances the securityof conventional DHT routing by leveraging trusted so-cial links when available. For example, when routingtowards a DHT key, if leveraging a social link makesforward progress in the DHT namespace, then the so-cial link is used for routing. The authors of Sproutconsidered a linear trust decay model, where a user’ssocial contacts are trusted with probability f (set to0.95 in [17]), and the trust in other users decreases asa linear function of the shortest path distance betweenthe users (a decrement of 0.05 is used in [17]). The

Table 1. Path reliability using Sprout for a lin-ear trust decay model.

Facebook interaction graph Facebook friendship graphMechanism Reliability Mechanism Reliability

Original 0.110 Original 0.140t = 3 0.101 t = 3 0.126t = 5 0.101 t = 5 0.121t = 10 0.096 t = 10 0.118Chord 0.075 Chord 0.072

decrement is bounded by a value reflecting the prob-ability of a random user in the network being trusted(set to 0.6 in [17]).

The reliability of a DHT lookup in Sprout is definedas the probability of all users in the path being trusted.Table 1 depicts the reliability of routing using a singleDHT lookup in Sprout, for the original and the per-turbed topologies. We used Chord as the underlyingDHT system. For each perturbation parameter, our re-sults were averaged over 100 perturbed topologies. Wecan see that as the perturbation parameter increases,the utility of application decreases. For example, usingthe original Facebook interaction topology, the relia-bility of a single DHT path in sprout is 0.11, whichdrops to 0.10 and 0.096 when using perturbed topolo-gies with parameters t = 5 and t = 10 respectively.However, even when using t = 10, the performance isbetter as compared with the scenario where social linksare not used for routing (Chord’s baseline performanceof 0.075). We can see similar results for the Facebookfriendship graph as well.

7.2 Sybil detection

In a Sybil attack [8], a single user or an entity em-ulates the behavior of multiple identities in a system.Many systems are built on the assumption that thereis a bound on the fraction of malicious nodes in thesystems. By being able to insert a large number ofmalicious identities, an attacker can compromise thesecurity properties of such systems. Sybil attacks are apowerful threat against both centralized as well as dis-tributed systems, such as reputation systems, consen-sus and leader election protocols, peer-to-peer systems,anonymity systems, and recommendation systems.

A wide body of recent work has proposed to leveragetrust relationships embedded in social networks for de-tecting Sybil attacks [36, 35, 6, 23, 31]. However, in allof these mechanisms, an adversary can learn the trustrelationships in the social network. Next, we showthat our perturbation algorithm preserves the abilityof above mechanisms to detect Sybils, while protectingthe privacy of the social network trust relationships.

0 10 20 30 40 50 60 70 80 90

100

2 4 6 8 10 12 14 16 18 20

% A

ccep

ted

nod

es

Random Walk Length

Originalt=5t=10

(a)

0 10 20 30 40 50 60 70 80 90

100

2 4 6 8 10 12 14 16 18 20

% A

ccep

ted

nod

es

Random Walk Length

Originalt=5t=10

(b)

Figure 13. SybilLimit % validated honest nodes as a function of SybilLimit random route length for (a)Facebook interaction graph (b) Facebook link graph. We can see that for a false positive rate of1 − 2%, the required random route length for our perturbed topologies is a factor of 2 − 3 smalleras compared with the original topology. Random route length is directly proportional to number ofSybil identities that can be inserted in the system.

0 500

1000 1500 2000 2500 3000 3500 4000 4500

0 500 1000 1500 2000 2500 3000 3500 4000

Final A

ttack

Ed

ges

Attack Edges

Originalt=5t=10

(a)

0

1000

2000

3000

4000

5000

6000

7000

0 1000 2000 3000 4000 5000 6000

Final A

ttack

Ed

ges

Attack Edges

Originalt=5t=10

(b)

Figure 14. Attack edges in perturbed topologies as a function of attack edges in the original topology. Wecan see that there is a marginal increase in the number of attack edges in the perturbed topologies.The attack edges are directly proportional to the number of Sybil identities that can be inserted inthe system.

0

5000

10000

15000

20000

25000

30000

0 500 1000 1500 2000

Num

ber

of

Syb

il id

en

titi

es

Number of compromised nodes

Originalt=5t=10

(a)

0

5000

10000

15000

20000

25000

30000

0 500 1000 1500 2000

Num

ber

of

Syb

il id

en

titi

es

Number of compromised nodes

Originalt=5t=10

(b)

Figure 15. Number of Sybil identities accepted by SybilInfer as a function of number of compromisednodes in the original topology. We can see that there is a significant decline in the number of Sybilidentities using our perturbation algorithms.

7.2.1 SybilLimit

We use SybilLimit [35] as a representative protocol forSybil detection, since it is the most popular and wellunderstood mechanism in the literature. SybilLimitis a probabilistic defense, and has both false positives(honest users misclassified as Sybils) and false nega-tives (Sybils misclassified) as honest users.

We compared the performance of running Sybil-Limit on original graph, and on the transformed graph,for varying perturbation parameters. For each pertur-bation parameter, we averaged the results over 100 per-turbed topologies. Figure 13 depicts the percentage ofhonest users validated by SybilLimit using the originalgraph and perturbed graphs, as a function of the Sybil-Limit random route length (application parameter w).For any value of the SybilLimit random route length,the percentage of honest nodes accepted by SybilLimitis higher when using perturbed graphs. This is becauseour perturbation algorithms reduce the mixing time ofthe graph. In fact, for a false positive percentage of1-2% (99-98% accepted nodes), the required length ofthe SybilLimit random routes is a factor of 2-3 smalleras compared to the original topology. SybilLimit ran-dom routes are directly proportional to the number ofSybil identities that can be inserted in the system.

This improvement in the number of accepted honestnodes (reduction in false positives) comes at the cost ofincrease in the number of attack edges between honestusers and the attacker. Figure 14 depicts the numberof attack edges in the perturbed topologies, for varyingvalues of attack edges in the original graph. We cansee that as expected, there is a marginal increase in thenumber of attack edges in the perturbed topologies.

Remark: The number of Sybil identities that an ad-versary can insert is given by S = g′ ·w′. We note thatthe marginal increase in the number of attack edges g′

is offset by the reduced length of the SybilLimit ran-dom route parameter w (for any desired false positiverate), thus achieving comparable performance with theoriginal social graph. In fact, for perturbed topologies,since the required random route length in SybilLimit ishalved for a false positive rate of 1-2%, and the increasein the number of attack edges is less than a factor oftwo, the Sybil defense performance has improved usingour perturbation mechanism. Thus for Sybil defenses,our perturbation mechanism is of independent inter-est, even without considering the benefit of link pri-vacy. We further validate this conclusion using anotherstate-of-art detection mechanism called SybilInfer.

7.2.2 SybilInfer

We compared the performance of running SybilInferon real and perturbed topologies. Figure 15 depictsthe optimal number of Sybil identities that an adver-sary can insert before being detected by SybilInfer, asa function of real compromised and colluding users.Again, we can see that the performance of perturbedgraphs is better than using original graphs. This isdue to the interplay between mixing time of graphsand the number of attack edges in the Sybil defenseapplication. Our perturbation mechanism significantlyreduces the mixing time of the graphs, while sufferingonly a marginal increase in the number of attack edges.It is interesting to see that the advantage of using ourperturbation mechanism is less in Figure 15(b), as com-pared to Figure 15(a). This is because the mixing timeof the Facebook friendship graph is much better (ascompared with the the mixing time of the Facebookinteraction graph). Thus we conclude that our pertur-bation mechanism improves the overall Sybil detectionperformance of existing approaches, especially for in-teraction based topologies that exhibit relatively poormixing characteristics.

8 Conclusion and Future Work

In this work, we proposed a random walk basedperturbation algorithm that anonymizes links in a so-cial graph while preserving the graph theoretic prop-erties of the original graph. We provided formal defi-nitions for utility of a perturbed graph from the per-spective of vertices, related our definitions to globalgraph properties, and empirically analyzed the prop-erties of our perturbation algorithm using real worldsocial networks. Furthermore, we analyzed the privacyof our perturbation mechanism from several perspec-tives (a) a Bayesian viewpoint that takes into consid-eration specific adversarial prior, and (b) a risk basedview point that is independent of the adversary’s prior.We also formalized the relationship between utility andprivacy of perturbed graphs. Finally, we experimen-tally demonstrated the applicability of our techniqueson applications such as Sybil defenses and secure rout-ing. For Sybil defenses, we found that our techniquesare of independent interest.

Our work opens several directions for future re-search, including (a) investigating the applicability ofour techniques on directed graphs (b) modeling closedform expressions for computing link privacy using theBayesian framework (c) investigating tighter bounds onε for computing link privacy in the risk-based frame-work, and (d) modeling temporal dynamics of social

networks in quantifying link privacy.By protecting the privacy of trust relationships, we

believe that our perturbation mechanism can act as akey enabler for real world deployment of secure systemsthat leverage social links.

Acknowledgments

We are very grateful to Satish Rao for helpful discus-sions on graph theory, including insights on utility met-rics for perturbed graphs, and relating mixing timesof original and perturbed graphs. We would also liketo thank Ling Huang, Adrian Mettler, Nguyen Tran,and Mario Frank for helpful discussions on analyzinglink privacy, as well as their feedback on early versionsof this work. Our analysis of SybilLimit is based onextending a simulator written by Bimal Vishwanath.This work was supported by Intel through the ISTCfor Secure Computing, by the National Science Foun-dation under grant numbers CCF-0424422, 0842695and 0831501 CT-L, by the Air Force Office of ScientificResearch (AFOSR) under MURI award FA9550-09-1-0539 and grant number FA9550-08-1-0352 and by theOffice of Naval Research under MURI grant numberN000140911081.

References

[1] D. Aldous and J. Fill. ReversibleMarkov Chains and Random Walks onGraphs. http://www.stat.berkeley.edu/ al-dous/RWG/book.html.

[2] L. Backstrom, C. Dwork, and J. Kleinberg. Whereforeart thou r3579x?: anonymized social networks, hiddenpatterns, and structural steganography. In WWW,2007.

[3] M. Bastian, S. Heymann, and M. Jacomy. Gephi: Anopen source software for exploring and manipulatingnetworks. In International AAAI Conference on We-blogs and Social Media, 2009.

[4] S.-H. Cha. Comprehensive survey of dis-tance/similarity measures between probabilitydensity functions. International Journal of Math-ematical Models and Methods in Applied Sciences,1(4), 2007.

[5] G. Danezis, C. Diaz, C. Troncoso, and B. Laurie. Drac:an architecture for anonymous low-volume communi-cations. In PETS, 2010.

[6] G. Danezis and P. Mittal. Sybilinfer: Detecting Sybilnodes using social networks. In NDSS, 2009.

[7] R. Dey, Z. Jelveh, and K. Ross. Facebook users havebecome much more private: a large scale study. Tech-nical report, New York University, 2011.

[8] J. Douceur. The Sybil Attack. In IPTPS, 2002.

[9] C. Dwork. Differential privacy: a survey of results. InTAMC, 2008.

[10] W. K. Hastings. Monte carlo sampling methods us-ing markov chains and their applications. Biometrika,57(1), 1970.

[11] M. Hay, G. Miklau, D. Jensen, D. Towsley, andP. Weis. Resisting structural re-identification inanonymized social networks. Proc. VLDB Endow.,1(1), 2008.

[12] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivas-tava. Anonymizing social networks. Technical Report07-09, University of Massachusetts, Amherst, 2007.

[13] A. Korolova, R. Motwani, S. U. Nabar, and Y. Xu.Link privacy in social networks. In CIKM, 2008.

[14] C. Lesniewski-Laas and M. F. Kaashoek. Whanaun-gatanga: A Sybil-proof distributed hash table. InNSDI, 2010.

[15] K. Liu and E. Terzi. Towards identity anonymizationon graphs. In SIGMOD, 2008.

[16] X. Lu, Y. Song, and S. Bressan. Fast identityanonymization on graphs. In DEXA, 2012.

[17] S. Marti, P. Ganesan, and H. Garcia-Molina.SPROUT: P2P routing with social networks. InP2P&DB, 2004.

[18] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, and E. Teller. Equation of state calcu-lations by fast computing machines. The Journal ofChemical Physics, 21(6), 1953.

[19] A. Mislove, A. Post, P. Druschel, and K. P. Gummadi.Ostra: leveraging trust to thwart unwanted communi-cation. In NSDI, 2008.

[20] P. Mittal, N. Borisov, C. Troncoso, and A. Rial. Scal-able anonymous communication with provable secu-rity. In USENIX HotSec, 2010.

[21] P. Mittal, M. Caesar, and N. Borisov. X-vine: Secureand pseudonymous routing using social networks. InNDSS, 2012.

[22] P. Mittal, M. Wright, and N. Borisov. Pisces: Anony-mous communication using social networks. In NDSS,2013.

[23] A. Mohaisen, N. Hopper, and Y. Kim. Keep yourfriends close: Incorporating trust into social network-based sybil defenses. In INFOCOM, 2011.

[24] A. Mohaisen, A. Yun, and Y. Kim. Measuring themixing time of social graphs. In IMC, 2010.

[25] S. Nagaraja. Anonymity in the wild: mixes on un-structured networks. In PETS, 2007.

[26] A. Narayanan and V. Shmatikov. De-anonymizing so-cial networks. In IEEE SSP, 2009.

[27] J. A. Pouwelse, P. Garbacki, J. Wang, A. Bakker,J. Yang, A. Iosup, D. H. J. Epema, M. Reinders, M. R.van Steen, and H. J. Sips. Tribler: a social-based peer-to-peer system: Research articles. Concurr. Comput.: Pract. Exper., 20(2), 2008.

[28] V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Rela-tionship privacy: output perturbation for queries withjoins. In PODS, 2009.

[29] A. Sala, X. Zhao, C. Wilson, H. Zheng, and B. Y.Zhao. Sharing graphs using differentially private graphmodels. In IMC, 2011.

[30] Y. Sovran, J. Li, and L. Subramanian. Unblockingthe internet: Social networks foil censors. Technicalreport, NYU, 2008.

[31] N. Tran, J. Li, L. Subramanian, and S. Chow. Optimalsybil-resilient node admission control. In INFOCOM,2011.

[32] N. Tran, B. Min, J. Li, and L. Subramanian. Sybil-resilient online content voting. In NSDI, 2009.

[33] B. Viswanath, A. Mislove, M. Cha, and K. P. Gum-madi. On the evolution of user interaction in Face-book. In WOSN, 2009.

[34] X. Ying and X. Wu. Randomizing social networks: aspectrum preserving approach. In SDM, 2008.

[35] H. Yu, P. B. Gibbons, M. Kaminsky, and F. Xiao.Sybillimit: A near-optimal social network defenseagainst Sybil attacks. In IEEE S&P, 2008.

[36] H. Yu, M. Kaminsky, P. Gibbons, and A. Flaxman.SybilGuard: Defending against Sybil attacks via socialnetworks. In SIGCOMM, 2006.

[37] E. Zheleva and L. Getoor. Preserving the privacyof sensitive relationships in graph data. In PinKDD,2008.

[38] B. Zhou and J. Pei. Preserving privacy in social net-works against neighborhood attacks. In ICDE, 2008.

[39] B. Zhou and J. Pei. The k-anonymity and l-diversityapproaches for privacy preservation in social networksagainst neighborhood attacks. Knowl. Inf. Syst.,28(1), 2011.

A Proof of Theorem 1: Relating vertexutility and mixing time.

We now sketch the proof of the above theorem.From the definition of total variation distance, we cansee that:

||P tv(G′)−π||tvd ≤ ||P tv(G′)−P tv(G)||tvd+||P tv(G)−π||tvd .

From the definition of mixing time, we have that forall t ≥ τG(ε):

||P tv(G′)− π||tvd ≤ ||P tv(G′)− P tv(G)||tvd + ε .

Substituting t = τG(ε) in the above equation, and tak-ing the maximum over all vertices, we have that:

maxv||P τG(ε)

v (G′)− π||tvd ≤

maxv||P τG(ε)

v (G′)− P τG(ε)v (G)||tvd + ε ≤

VUmax(G,G′, τG(ε)) + ε .

Finally, we have that

τG′(ε+ VUmax(G,G′, τG(ε)) ≤ τG(ε) .

B Proof of Theorem 2: Relating vertexutility and SLEM.

We sketch the proof of the theorem. It is known thatfor undirected graphs, the second largest eigenvaluemodulus is related to the mixing time of the graphas follows [1]:

µG′

2(1− µG′)log

(1

2ε

)≤ τG′(ε) ≤

log n+ log(1ε

)1− µG′

.

From the above equation, we can bound the SLEM interms of the mixing time as follows:

1−log n+ log( 1

ε )

τG′(ε)≤ µG′ ≤ 2τG′(ε)

2τG′(ε) + log( 12ε )

.

Set χ = VUmax(G,G′, τG(ε)). Replacing ε with ε + χ,we have that

1−log n+ log( 1

ε+χ )

τG′(ε+ χ)≤ µG′ ≤ 2τG′(ε+ χ)

2τG′(ε+ χ) + log( 12ε+2χ )

.

Finally, from Theorem 1, we leverage τG′(ε+χ) ≤ τG(ε)in the above equation to get

µG′ ≤ 2τG(ε)

2τG(ε) + log( 12ε+2χ )

.

C Proof of Theorem 4: Bounding mix-ing time.

Observe that the edges in graph G′ can be mod-eled as samples from the t-hop probability distributionof random walks starting from vertices in G. We willprove the lower bound on the mixing time of the per-turbed graph G′ by contradiction: let us suppose that

the mixing time of the graph G′ is k < τG(ε)t . Then

in the original graph G, a user could have performedrandom walks of length k · t and achieve a variationdistance less than ε. But k · t < τG(ε), which is a con-

tradiction. Thus, we have that τG(ε)t ≤ τG′(ε).

We prove an upper bound on mixing time of the per-turbed graph using the notion of graph conductance.Let us denote the number of edges across the bottle-neck cut (say S) of the original topology as g. Observethat the t-hop conductance between the sets S and Sis strictly larger than the corresponding one hop con-ductance (since S is the bottleneck cut in the origi-nal topology). Thus, E(G′) ≥ g. Hence the expectedgraph conductance is an increasing function of the per-turbation parameter t, and thus E(τG′(ε)) ≤ τG(ε).

D Proof of Theorem 5: Relating utilityand privacy.

From the definition of maximum vertex utility, wehave that |P lAB(G) − P lAB(G′)| ≤ VUmax(G,G′, l).Thus, we can bound P lAB(G) as follows:

P lAB(G′)− χ ≤ P lAB(G) ≤ P lAB(G′) + χ ,

where χ = VUmax(G,G′, l). Thus for any value of k, ifP kAB(G′)−VUmax(G,G′, k) > 0, then we have that thelower bound on the probability P kAB(G) > 0, which re-veals the information that A and B are within an k hop

neighborhood of each other. Thus the maximum infor-mation is revealed when the value of k is minimized,while maintaining P kAB(G′)−VUmax(G,G′, k) > 0, i.e.,k = δ. This gives us a lower bound on the probabilityof A and B being friends in the original graph: theprior probability that two vertices in a δ hop neigh-borhood are friends: f(δ). Let mδ denote the averagenumber of links in a δ hop neighborhood, and let nδdenote the average number of vertices in a δ hop neigh-borhood. In the special case of a null prior, we havethat f(δ) = mδ/

(nδ2

).

preserving link privacy in social network based...

Documents