network centrality, inference and local computation

Click here to load reader

Upload: annice

Post on 25-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Network centrality, inference and local computation . Devavrat Shah LIDS+CSAIL+EECS+ORC Massachusetts Institute of Technology. Network centrality. It’s a graph score function Given graph G=(V, E) Assigns “scores” to nodes in the graph That is, F : G  R V - PowerPoint PPT Presentation

TRANSCRIPT

Extracting Knowledge From Networks: Rumors, Communities, and Superstars

Network centrality, inference and local computation Devavrat Shah

LIDS+CSAIL+EECS+ORC

Massachusetts Institute of Technology

1Network centralityIts a graph score function Given graph G=(V, E)Assigns scores to nodes in the graphThat is, F: G RV

A given network centrality or graph score functionDesigned with aim of solving a certain task at hand

2Network centrality: exampleDegree centralityGiven graph G=(V, E)Score of node v is its degree

Useful for finding how connected each node isFor example, useful for social clustering [Parth.-Shah-Zaman 14]

3224113Network centrality: exampleBetween-ness centralityGiven graph G=(V, E)Score of node v is proportional toNumber of node pairs whose shortest path pass through it

Represents how critical each node is to keep network connected

0.5006.5404Network centrality: examplePageRank centralityGiven di-graph G=(V, E)Score of node v is equal to stationary distribution Of a random walk on the directed graph G

Transition probability matrix of random walk (RW)If RW at node i at a given time step, it will be at Node j, with probability Qij where If i has directed edge to j then Qij = (1-)/di + /n Else Qij = /n

5Network centrality: data processingCorpus of WebpagesData(Networked)DecisionSearch Relevant ContentPageRankCitationDataScientific ImportanceH-indexWhy (or why not) does a given centrality make sense? 6Statistical data processingDataDecisionStatisticalModelExample task: transmit a MSG bit B (= 0 or 1)Tx : BBBBB Rx : Each bit is flipped with probability p (=0.1)At Rx, using received message, decide whetherIntended MSG is 0 or 1ML estimation:Majority Rule

7Statistical data processingDataDecisionStatisticalModelData to DecisionPosit model connecting data to decision (variables)Learn the modelSubsequently make decisions For example, solve a stochastic optimization problem

8This talkNetwork centralityStatistical view For processing networked data Graph score function = appropriate likelihood function

Explain this view throughSearching source of information/infection spreadRumor centralityOther examples in later talksRank centrality Crowd centrality

Local computationStationary probability of a given node in a graph

1854 London Cholera EpidemicxCholera source

Dr. John SnowCenter of massCan we find the source in a network?John Snow is arguably father of Epidemiology. The germ theory, which identified Specific bacteria, for example, being responsible for Cholera was only put forth in1861. Before that, there was popular belief that Cholera was spread due to the air pollution. Now the turning moment was the Cholera epidemic in London in 1854. During this time, many people died in primarily Soho area. With help of local reverand, Snow interviewed the family of deceased people. And identified commonality asAlmost all of them stayed near and used the water pump on the broad(wick) street. Therefore, Snow concluded that the pump was responsible for this ill fate.

Here I am plotting points where deceased people were staying. And the pump as theCholera source.

There is further twist to this story: post fire, ev. body had cesspit underneath theirHomes. And one of the cesspit was leaking in the well of the water pump. It getsMore interesting: just before the outbreak, it seems nappies of baby with Cholera (from other Source) were washed in that cesspit !

So that confirmed the doubt and led to water inspection and rest is the history.

Now: suppose we did not know reverend, nor possibly as thoughtful/resourceful, butHave network knowledge. How would one find the source.

Well: here you find the Centre of Gravity !

10Stuxnet (and Duqu) worm: who started it ?

Searching for sourceStuxnet: a computer worm that spread in July 2010. It spreads in the network (P2P manner). It isThe so called rootkit which alters the core of the system and it has programmable logic, i.e. it canKeep innovating. It aimed at affecting primarily industrial software and in particular certain typesOf Siemens equipment. There is little known officially but widely believed that it has put the nuclearProgram of certain countries behind by few years if not a decade : affecting the centrifuge of Euranium enriching. It is brilliant ! Whichever side you are you would like to know either (a) who is the likely source, or (b) can the likely source be detected ?

11Cyber-attacksViral epidemics

Social trends

Searching for source12Searching for sourceDataStatisticalModelDecisionInfected Nodes,NetworkHow Likely EachNode as Source?

Uniform probability of any node being source a prioriSpreading times on each edge are i.i.d random variables.We will assume an exponential distribution (to develop intuition)Results will hold for generic distribution (with no atom at 0)

Model: Susceptible Infected (SI)14Rumor Source EstimatorWe know the rumor graph GWe want to find the likelihood function:

GP(G|source=v)Not obvious how to calculate itv15

More spreading orders = more likely to be sourceRumor Spreading OrderRumor spreading order not knownOnly spreading constraints are available132414322314123416New problem: counting spreading orders

P(G|source=2) = P(2134|source=2) + P(2143|source=2)1324all spreading orders are equally likely= 2 * p(d=3,N=4)

Regular TreesRegularity of tree + memory-less property of exponential =

17Counting Spreading OrdersR(v, G)= number of rumor spreading orders from v on G

1234

N=Network sizeT=Subtree size18Rumor Centrality (Shah-Zaman, 2010)The source estimator is a graph score function

It is the right score function for source detectionLikelihood function for regular trees with exponential spreading timesCan be calculated in linear time Rumor Centrality and Random WalkStationary probability of visiting node Rumor Centrality

1/71/75/71/73/73/71/71/75/7Random walk with transition probabilityProportional to size of sub-treesStationary distribution =Rumor Centrality

Rumor Centrality : General NetworkRumor spreads on an underlying spanning tree of graphBreadth-first search tree: most likely treeFast rumor spreading

Precisely, next hop info required under shortest path routing ! Precision of Rumor Centrality

1.00.80.60.40.20.0

True rumor sourceEstimate of rumor sourceRumor centrality(normalized)Precision of Rumor Centrality

1.00.80.60.40.20.0True rumor sourceEstimate of rumor sourceRumor centrality(normalized) Bin Laden Death Twitter NetworkKeith Urbahn: first to tweet about the death of Osama bin Laden

Estimate of rumor sourceTrue rumor sourceEffectiveness of rumor centralitySimulations and examples showRumor centrality is useful to find sources

NextWhen does it workWhen it does not workAnd, why Source Estimation = Rumor CenterRumor center v* has maximal rumor centrality

V*jTv*jNetwork is balancedaround rumor centerRumor centerIf rumor spreads in a balanced manner:Source = Rumor CenterRegular Trees (degree=2)

Proposition [Shah-Zaman, 2010]: Let a rumor spread for a time t on a regular tree with degree d=2 as per the SI model with exponential (or arbitrary) spreading time (with non-trivial variance). Then, Balanced sub-treesThat is, line graphs are hopelessWhat about a generic tree ? 27Some Useful NotationRumor spreads for time t to n(t) nodesLet sequence of infected nodes be {v1, v2, , vn(t)}v1 = rumor sourceCn(t) = {rumor center is vk after n(t) nodes are infected}Cn(t) = correct detection

v2v3v1v4k1

Result 1: Geometric TreesNumber of nodes distance l from any node grows as la(polynomial growth)

Proposition [Shah-Zaman, 2011]: Let a rumor spread for a time t on a (regular) geometric tree with a>0 from a source with degree > 2 as per the SI model with arbitrary spreading times (with exponential moments). Thena=129Result 2: Regular Trees (degree>2)Exponential growthHigh variance rumor graph

Theorem [Shah-Zaman, 2011]: Let a rumor spread for a time t on a regular tree with degree d>2 as per the SI model with exponential spreading times. Then

and Ix(a,b) is the regularized incomplete Beta function:where

Result 2: Regular Trees (degree>2)

3 = 0.251-ln(2)Result 2: Regular Trees (degree>2)

Theorem [Shah-Zaman, 2011]: Let a rumor spread for a time t on a regular tree with degree d>2 as per the SI model with exponential spreading times. ThenResult 2: Regular Trees (degree>2)With high probability estimate is close to true source

33Result 3: Generic Random TreesStart from root, each node i has hi children (hi are i.i.d.)

Theorem [Shah-Zaman, 2012]: : Let a rumor spread for a time t on a random tree with E[hi]>1 and E[hi2]< from a source with degree > 2 as per the SI model with arbitrary spreading times (non-atomic at 0). Thenh1=3h2=2h3=41234h4=38Implication: Sparse random graphsRandom regular graph regular tree Erdos-Renyi graph random tree with hi ~ Binomial distribution (Poisson in large limit)Tree results extend Erdos-Renyi GraphsGraph has m nodes, each edge exists independently with probability p=c/m

Regular tree (degree = 10,000)Proof RemarksT2(t)T1(t)T3(t)Incorrect DetectionV1

Imbalance

EvaluatingT2(t)T1(t)T3(t)V1

Standard approach: Compute E[Tl(t)]Show concentration of Tl(t) around its mean E[Tl(t)]Use it to evaluateP(Ti(t) > Tj(t))IssuesVariance in Tl(t) is of same order as meanHence, usual concentration is not usefulEven if it wereit would result in 0/1 style answer (which is unlikely)

EvaluatingT2(t)T1(t)T3(t)V1

An alternative: Understand ratio Ti(t)/ Tj(t) Characterize its limiting distribution That is, Ti(t)/ Tj(t) WUse W to evaluate P(Ti(t) > Tj(t)) = P(W>0.5)

Goal:How to find W ?

Evaluating

V1T1(t)T2(t)+T3(t)Z(t) =Z2(t) +Z3(t)= Rumor Boundary of T2(t) +T3(t)InitiallyT1(0)=0T2(0) + T3(0)=0Z1(0) = 1Z2(0)+Z3(0) = 2

Z1(t)= RumorBoundary of T1(t)First infectionT1(.)=1T2(.) + T3(.)=0Z1(.) = 2Z2(0)+Z3(0) = 2

Second infectionT1(.)=1T2(.) + T3(.)=1Z1(.) = 2Z2(0)+Z3(0) = 3

In summaryZ1(t)= T1(t)+1Z2(t)+Z3(t) = T2(t) + T3(t) +2Therefore, for large t T1(t)/(T2(t) + T3(t)) equals Z1(t)/(Z2(t) + Z3(t))Therefore, track ratio of boundaries

Consider the sub-tree growth : for example, there are subtrees and their boundaries. For regular tree with d > 2 or generic tree, each of the tree size grows exponentially in t. Therefore, boundary nodes are also of the same order as the tree size itself. First attempt to bound above is to obtain concentration around mean and then clearly find out about the separation. But each of these have very high variance and of the same order as the size of the tree. Therefore, concentration results will suggest that each of the sizes are in the variance uncertainty region of each other and hence not clear if provide any meaningful answer. Further, the average value of tree size is very time sensitive with v. high variances.

41Evaluating

V1T1(t)T2(t)+T3(t)Z(t) =Z2(t) +Z3(t)= Rumor Boundary of T2(t) +T3(t)Z1(t)= RumorBoundary of T1(t)Boundary evolutionTwo types: Z1(t) and Z(t)Each new infection increasesZ1(t) or Z(t) by +1Selection of Z1(t) vs Z(t):Z1(t) with prob. Z1(t)/(Z1(t) + Z(t))Z(t) with prob. Z(t)/(Z1(t) + Z(t))

This is exactly Polyas UrnWith two types of balls Consider the sub-tree growth : for example, there are subtrees and their boundaries. For regular tree with d > 2 or generic tree, each of the tree size grows exponentially in t. Therefore, boundary nodes are also of the same order as the tree size itself. First attempt to bound above is to obtain concentration around mean and then clearly find out about the separation. But each of these have very high variance and of the same order as the size of the tree. Therefore, concentration results will suggest that each of the sizes are in the variance uncertainty region of each other and hence not clear if provide any meaningful answer. Further, the average value of tree size is very time sensitive with v. high variances.

42Evaluating

V1T1(t)T2(t)+T3(t)Z(t) =Z2(t) +Z3(t)= Rumor Boundary of T2(t) +T3(t)Z1(t)= RumorBoundary of T1(t)Boundary evolution = Polyas UrnM(t) = Z1(t)/(Z1(t) + Z(t))Converges almost surely to a r.v. W Goal: P(T1 (t) > (T2(t) + T3(t))) = P(W > 0.5)W has Beta(1,2) distribution

Consider the sub-tree growth : for example, there are subtrees and their boundaries. For regular tree with d > 2 or generic tree, each of the tree size grows exponentially in t. Therefore, boundary nodes are also of the same order as the tree size itself. First attempt to bound above is to obtain concentration around mean and then clearly find out about the separation. But each of these have very high variance and of the same order as the size of the tree. Therefore, concentration results will suggest that each of the sizes are in the variance uncertainty region of each other and hence not clear if provide any meaningful answer. Further, the average value of tree size is very time sensitive with v. high variances.

43Probability of correct detectionFor generic d-regular tree The corresponding W is Beta(1/(d-2), (d-1)/(d-2))Therefore

Where (with d =1/(d-2))

Generic Trees: Branching Process

Z(0)=k-1Z(0)=1 V1T1(t)= SubtreeZ(t)= Rumor boundary (branching process)Lemma (Shah-Zaman 12): For large t, Z(t) proportional to T1(t). T1(t) Z(t) T2(t)++Tk(t) Z(t)Branching Process ConvergenceFollowing result known for branching processes (cf. Athreya-Ney 67)

a is the Malthusian parameter depends on distribution of spreading time and node degree W is a non-degenerate RV with absolutely continuous distributionFor regular tree, exponential spreading times, W has a Beta distribution

Summary, thus farRumor source detection

Useful Graph Score Function: Rumor centralityExact likelihood function for certain networksCan be computed quickly (e.g. using linear iterative algorithm)

EffectivenessAccurately finds source on essentially any tree or sparse random graphany spreading time distribution

What else can it be useful for?Thesis of Zaman Twitter Search EngineBhamidi, Steele and Zaman 13Computing centralityComputing centrality is equal to findingStationary distribution of random walk on networkFor a reasonably many settings, includingPageRankRumor centralityRank centrality

Well, that should be easy

Computing stationary distributionPower iteration method [cf. Golub-Loan 96]

It primarily requires centralized computationIteratively multiply matrix and vector 100Gb of RAM will limit matrix size to ~100kBut, a social network can be more than a million And, web is much larger

So, its not that easyComputing stationary distributionPageRank specific local computation solution

A collection of clever, powerful solutionsJeh et.al. 2003, Fogaras et.al. 2005, Avrachenkov et.al. 2007, Bahmani et.al. 2010, Borgs et al 2012

Rely on the fact that From each node, transition to any other node happensWith probability greater or equal to a known fixed positive constant (/n)

Do not extend for any random walk or countably infinite graphs

Markov chain, Stationary distributionRandom walk or Markov chain

(unknown) finite size or countably infinite size state spaceEach node/state can execute next step of Markov chainJump from state i to j with probability PijIrreducible, aperiodic It means that there is a well defined, unique stationary distribution

Goal: for any given node i, obtain estimation ofBy accessing only local neighborhood of node i

Key property

True value:expectedreturn time

average truncatedreturn timeEstimate:AlgorithmInput: Markov chain (, P) and node i Parameters:

Gather SamplesTerminate if SatisfiedUpdate and Repeat

AlgorithmGather Samples Sample truncated return paths:

= fraction of samples truncated

Terminate if SatisfiedUpdate and Repeat

i

5555i

5656i

5757i

5858i

5959i

Returned to !

6060i

6161i

6262i

Keep walking 6363i

Path length exceeded

6464i

65The key idea of our algorithm is to use trunctation, however we can see that it is a tradeoff. By truncating the walk, we lose information about how much longer this sampled return time would have been, yet on the other hand, we also save computation time.In the following slides we will analyze the effect of truncation on the estimate for different nodes.65AlgorithmGather Samples

Terminate if SatisfiedUpdate and RepeatDouble and increase such thatwith probability greater than ,

closeness of estimateconfidence

AlgorithmGather SamplesTerminate if SatisfiedTerminate and output current estimate if (a) Node is unimportant enough

(b) Fraction of truncated samples is small

Update and Repeat

threshold forimportanceLocal Computation [Lee-Ozd-Shah 13]

Correctness: under (a) [Lee-Ozd-Shah 13]

Correctness: under (b) [Lee-Ozd-Shah 13]

Simulation: PageRank

Nodes sorted by stationary probabilityStationaryprobability

Random graph using configuration model and power law degree distributionSimulation: PageRank

Nodes sorted by stationary probabilityStationaryprobability

Obtain close estimatesfor important nodesSimulation: PageRank

Nodes sorted by stationary probabilityStationaryprobability

corrects for the bias!

Bias CorrectionTrue value:

Estimate:

Fraction samples not truncated

SummaryNetwork centrality

Useful tool for data processing

A principled approachGraph score function = Likelihood function

An example: Rumor centralityAccurate source detection

Local ComputationStationary distribution of Markov chain/Random walk