network centrality, inference and local computation
DESCRIPTION
Network centrality, inference and local computation . Devavrat Shah LIDS+CSAIL+EECS+ORC Massachusetts Institute of Technology. Network centrality. It’s a graph score function Given graph G=(V, E) Assigns “scores” to nodes in the graph That is, F : G R V - PowerPoint PPT PresentationTRANSCRIPT
Extracting Knowledge From Networks: Rumors, Communities, and Superstars
Network centrality, inference and local computation Devavrat Shah
LIDS+CSAIL+EECS+ORC
Massachusetts Institute of Technology
1Network centralityIts a graph score function Given graph G=(V, E)Assigns scores to nodes in the graphThat is, F: G RV
A given network centrality or graph score functionDesigned with aim of solving a certain task at hand
2Network centrality: exampleDegree centralityGiven graph G=(V, E)Score of node v is its degree
Useful for finding how connected each node isFor example, useful for social clustering [Parth.-Shah-Zaman 14]
3224113Network centrality: exampleBetween-ness centralityGiven graph G=(V, E)Score of node v is proportional toNumber of node pairs whose shortest path pass through it
Represents how critical each node is to keep network connected
0.5006.5404Network centrality: examplePageRank centralityGiven di-graph G=(V, E)Score of node v is equal to stationary distribution Of a random walk on the directed graph G
Transition probability matrix of random walk (RW)If RW at node i at a given time step, it will be at Node j, with probability Qij where If i has directed edge to j then Qij = (1-)/di + /n Else Qij = /n
5Network centrality: data processingCorpus of WebpagesData(Networked)DecisionSearch Relevant ContentPageRankCitationDataScientific ImportanceH-indexWhy (or why not) does a given centrality make sense? 6Statistical data processingDataDecisionStatisticalModelExample task: transmit a MSG bit B (= 0 or 1)Tx : BBBBB Rx : Each bit is flipped with probability p (=0.1)At Rx, using received message, decide whetherIntended MSG is 0 or 1ML estimation:Majority Rule
7Statistical data processingDataDecisionStatisticalModelData to DecisionPosit model connecting data to decision (variables)Learn the modelSubsequently make decisions For example, solve a stochastic optimization problem
8This talkNetwork centralityStatistical view For processing networked data Graph score function = appropriate likelihood function
Explain this view throughSearching source of information/infection spreadRumor centralityOther examples in later talksRank centrality Crowd centrality
Local computationStationary probability of a given node in a graph
1854 London Cholera EpidemicxCholera source
Dr. John SnowCenter of massCan we find the source in a network?John Snow is arguably father of Epidemiology. The germ theory, which identified Specific bacteria, for example, being responsible for Cholera was only put forth in1861. Before that, there was popular belief that Cholera was spread due to the air pollution. Now the turning moment was the Cholera epidemic in London in 1854. During this time, many people died in primarily Soho area. With help of local reverand, Snow interviewed the family of deceased people. And identified commonality asAlmost all of them stayed near and used the water pump on the broad(wick) street. Therefore, Snow concluded that the pump was responsible for this ill fate.
Here I am plotting points where deceased people were staying. And the pump as theCholera source.
There is further twist to this story: post fire, ev. body had cesspit underneath theirHomes. And one of the cesspit was leaking in the well of the water pump. It getsMore interesting: just before the outbreak, it seems nappies of baby with Cholera (from other Source) were washed in that cesspit !
So that confirmed the doubt and led to water inspection and rest is the history.
Now: suppose we did not know reverend, nor possibly as thoughtful/resourceful, butHave network knowledge. How would one find the source.
Well: here you find the Centre of Gravity !
10Stuxnet (and Duqu) worm: who started it ?
Searching for sourceStuxnet: a computer worm that spread in July 2010. It spreads in the network (P2P manner). It isThe so called rootkit which alters the core of the system and it has programmable logic, i.e. it canKeep innovating. It aimed at affecting primarily industrial software and in particular certain typesOf Siemens equipment. There is little known officially but widely believed that it has put the nuclearProgram of certain countries behind by few years if not a decade : affecting the centrifuge of Euranium enriching. It is brilliant ! Whichever side you are you would like to know either (a) who is the likely source, or (b) can the likely source be detected ?
11Cyber-attacksViral epidemics
Social trends
Searching for source12Searching for sourceDataStatisticalModelDecisionInfected Nodes,NetworkHow Likely EachNode as Source?
Uniform probability of any node being source a prioriSpreading times on each edge are i.i.d random variables.We will assume an exponential distribution (to develop intuition)Results will hold for generic distribution (with no atom at 0)
Model: Susceptible Infected (SI)14Rumor Source EstimatorWe know the rumor graph GWe want to find the likelihood function:
GP(G|source=v)Not obvious how to calculate itv15
More spreading orders = more likely to be sourceRumor Spreading OrderRumor spreading order not knownOnly spreading constraints are available132414322314123416New problem: counting spreading orders
P(G|source=2) = P(2134|source=2) + P(2143|source=2)1324all spreading orders are equally likely= 2 * p(d=3,N=4)
Regular TreesRegularity of tree + memory-less property of exponential =
17Counting Spreading OrdersR(v, G)= number of rumor spreading orders from v on G
1234
N=Network sizeT=Subtree size18Rumor Centrality (Shah-Zaman, 2010)The source estimator is a graph score function
It is the right score function for source detectionLikelihood function for regular trees with exponential spreading timesCan be calculated in linear time Rumor Centrality and Random WalkStationary probability of visiting node Rumor Centrality
1/71/75/71/73/73/71/71/75/7Random walk with transition probabilityProportional to size of sub-treesStationary distribution =Rumor Centrality
Rumor Centrality : General NetworkRumor spreads on an underlying spanning tree of graphBreadth-first search tree: most likely treeFast rumor spreading
Precisely, next hop info required under shortest path routing ! Precision of Rumor Centrality
1.00.80.60.40.20.0
True rumor sourceEstimate of rumor sourceRumor centrality(normalized)Precision of Rumor Centrality
1.00.80.60.40.20.0True rumor sourceEstimate of rumor sourceRumor centrality(normalized) Bin Laden Death Twitter NetworkKeith Urbahn: first to tweet about the death of Osama bin Laden
Estimate of rumor sourceTrue rumor sourceEffectiveness of rumor centralitySimulations and examples showRumor centrality is useful to find sources
NextWhen does it workWhen it does not workAnd, why Source Estimation = Rumor CenterRumor center v* has maximal rumor centrality
V*jTv*jNetwork is balancedaround rumor centerRumor centerIf rumor spreads in a balanced manner:Source = Rumor CenterRegular Trees (degree=2)
Proposition [Shah-Zaman, 2010]: Let a rumor spread for a time t on a regular tree with degree d=2 as per the SI model with exponential (or arbitrary) spreading time (with non-trivial variance). Then, Balanced sub-treesThat is, line graphs are hopelessWhat about a generic tree ? 27Some Useful NotationRumor spreads for time t to n(t) nodesLet sequence of infected nodes be {v1, v2, , vn(t)}v1 = rumor sourceCn(t) = {rumor center is vk after n(t) nodes are infected}Cn(t) = correct detection
v2v3v1v4k1
Result 1: Geometric TreesNumber of nodes distance l from any node grows as la(polynomial growth)
Proposition [Shah-Zaman, 2011]: Let a rumor spread for a time t on a (regular) geometric tree with a>0 from a source with degree > 2 as per the SI model with arbitrary spreading times (with exponential moments). Thena=129Result 2: Regular Trees (degree>2)Exponential growthHigh variance rumor graph
Theorem [Shah-Zaman, 2011]: Let a rumor spread for a time t on a regular tree with degree d>2 as per the SI model with exponential spreading times. Then
and Ix(a,b) is the regularized incomplete Beta function:where
Result 2: Regular Trees (degree>2)
3 = 0.251-ln(2)Result 2: Regular Trees (degree>2)
Theorem [Shah-Zaman, 2011]: Let a rumor spread for a time t on a regular tree with degree d>2 as per the SI model with exponential spreading times. ThenResult 2: Regular Trees (degree>2)With high probability estimate is close to true source
33Result 3: Generic Random TreesStart from root, each node i has hi children (hi are i.i.d.)
Theorem [Shah-Zaman, 2012]: : Let a rumor spread for a time t on a random tree with E[hi]>1 and E[hi2]< from a source with degree > 2 as per the SI model with arbitrary spreading times (non-atomic at 0). Thenh1=3h2=2h3=41234h4=38Implication: Sparse random graphsRandom regular graph regular tree Erdos-Renyi graph random tree with hi ~ Binomial distribution (Poisson in large limit)Tree results extend Erdos-Renyi GraphsGraph has m nodes, each edge exists independently with probability p=c/m
Regular tree (degree = 10,000)Proof RemarksT2(t)T1(t)T3(t)Incorrect DetectionV1
Imbalance
EvaluatingT2(t)T1(t)T3(t)V1
Standard approach: Compute E[Tl(t)]Show concentration of Tl(t) around its mean E[Tl(t)]Use it to evaluateP(Ti(t) > Tj(t))IssuesVariance in Tl(t) is of same order as meanHence, usual concentration is not usefulEven if it wereit would result in 0/1 style answer (which is unlikely)
EvaluatingT2(t)T1(t)T3(t)V1
An alternative: Understand ratio Ti(t)/ Tj(t) Characterize its limiting distribution That is, Ti(t)/ Tj(t) WUse W to evaluate P(Ti(t) > Tj(t)) = P(W>0.5)
Goal:How to find W ?
Evaluating
V1T1(t)T2(t)+T3(t)Z(t) =Z2(t) +Z3(t)= Rumor Boundary of T2(t) +T3(t)InitiallyT1(0)=0T2(0) + T3(0)=0Z1(0) = 1Z2(0)+Z3(0) = 2
Z1(t)= RumorBoundary of T1(t)First infectionT1(.)=1T2(.) + T3(.)=0Z1(.) = 2Z2(0)+Z3(0) = 2
Second infectionT1(.)=1T2(.) + T3(.)=1Z1(.) = 2Z2(0)+Z3(0) = 3
In summaryZ1(t)= T1(t)+1Z2(t)+Z3(t) = T2(t) + T3(t) +2Therefore, for large t T1(t)/(T2(t) + T3(t)) equals Z1(t)/(Z2(t) + Z3(t))Therefore, track ratio of boundaries
Consider the sub-tree growth : for example, there are subtrees and their boundaries. For regular tree with d > 2 or generic tree, each of the tree size grows exponentially in t. Therefore, boundary nodes are also of the same order as the tree size itself. First attempt to bound above is to obtain concentration around mean and then clearly find out about the separation. But each of these have very high variance and of the same order as the size of the tree. Therefore, concentration results will suggest that each of the sizes are in the variance uncertainty region of each other and hence not clear if provide any meaningful answer. Further, the average value of tree size is very time sensitive with v. high variances.
41Evaluating
V1T1(t)T2(t)+T3(t)Z(t) =Z2(t) +Z3(t)= Rumor Boundary of T2(t) +T3(t)Z1(t)= RumorBoundary of T1(t)Boundary evolutionTwo types: Z1(t) and Z(t)Each new infection increasesZ1(t) or Z(t) by +1Selection of Z1(t) vs Z(t):Z1(t) with prob. Z1(t)/(Z1(t) + Z(t))Z(t) with prob. Z(t)/(Z1(t) + Z(t))
This is exactly Polyas UrnWith two types of balls Consider the sub-tree growth : for example, there are subtrees and their boundaries. For regular tree with d > 2 or generic tree, each of the tree size grows exponentially in t. Therefore, boundary nodes are also of the same order as the tree size itself. First attempt to bound above is to obtain concentration around mean and then clearly find out about the separation. But each of these have very high variance and of the same order as the size of the tree. Therefore, concentration results will suggest that each of the sizes are in the variance uncertainty region of each other and hence not clear if provide any meaningful answer. Further, the average value of tree size is very time sensitive with v. high variances.
42Evaluating
V1T1(t)T2(t)+T3(t)Z(t) =Z2(t) +Z3(t)= Rumor Boundary of T2(t) +T3(t)Z1(t)= RumorBoundary of T1(t)Boundary evolution = Polyas UrnM(t) = Z1(t)/(Z1(t) + Z(t))Converges almost surely to a r.v. W Goal: P(T1 (t) > (T2(t) + T3(t))) = P(W > 0.5)W has Beta(1,2) distribution
Consider the sub-tree growth : for example, there are subtrees and their boundaries. For regular tree with d > 2 or generic tree, each of the tree size grows exponentially in t. Therefore, boundary nodes are also of the same order as the tree size itself. First attempt to bound above is to obtain concentration around mean and then clearly find out about the separation. But each of these have very high variance and of the same order as the size of the tree. Therefore, concentration results will suggest that each of the sizes are in the variance uncertainty region of each other and hence not clear if provide any meaningful answer. Further, the average value of tree size is very time sensitive with v. high variances.
43Probability of correct detectionFor generic d-regular tree The corresponding W is Beta(1/(d-2), (d-1)/(d-2))Therefore
Where (with d =1/(d-2))
Generic Trees: Branching Process
Z(0)=k-1Z(0)=1 V1T1(t)= SubtreeZ(t)= Rumor boundary (branching process)Lemma (Shah-Zaman 12): For large t, Z(t) proportional to T1(t). T1(t) Z(t) T2(t)++Tk(t) Z(t)Branching Process ConvergenceFollowing result known for branching processes (cf. Athreya-Ney 67)
a is the Malthusian parameter depends on distribution of spreading time and node degree W is a non-degenerate RV with absolutely continuous distributionFor regular tree, exponential spreading times, W has a Beta distribution
Summary, thus farRumor source detection
Useful Graph Score Function: Rumor centralityExact likelihood function for certain networksCan be computed quickly (e.g. using linear iterative algorithm)
EffectivenessAccurately finds source on essentially any tree or sparse random graphany spreading time distribution
What else can it be useful for?Thesis of Zaman Twitter Search EngineBhamidi, Steele and Zaman 13Computing centralityComputing centrality is equal to findingStationary distribution of random walk on networkFor a reasonably many settings, includingPageRankRumor centralityRank centrality
Well, that should be easy
Computing stationary distributionPower iteration method [cf. Golub-Loan 96]
It primarily requires centralized computationIteratively multiply matrix and vector 100Gb of RAM will limit matrix size to ~100kBut, a social network can be more than a million And, web is much larger
So, its not that easyComputing stationary distributionPageRank specific local computation solution
A collection of clever, powerful solutionsJeh et.al. 2003, Fogaras et.al. 2005, Avrachenkov et.al. 2007, Bahmani et.al. 2010, Borgs et al 2012
Rely on the fact that From each node, transition to any other node happensWith probability greater or equal to a known fixed positive constant (/n)
Do not extend for any random walk or countably infinite graphs
Markov chain, Stationary distributionRandom walk or Markov chain
(unknown) finite size or countably infinite size state spaceEach node/state can execute next step of Markov chainJump from state i to j with probability PijIrreducible, aperiodic It means that there is a well defined, unique stationary distribution
Goal: for any given node i, obtain estimation ofBy accessing only local neighborhood of node i
Key property
True value:expectedreturn time
average truncatedreturn timeEstimate:AlgorithmInput: Markov chain (, P) and node i Parameters:
Gather SamplesTerminate if SatisfiedUpdate and Repeat
AlgorithmGather Samples Sample truncated return paths:
= fraction of samples truncated
Terminate if SatisfiedUpdate and Repeat
i
5555i
5656i
5757i
5858i
5959i
Returned to !
6060i
6161i
6262i
Keep walking 6363i
Path length exceeded
6464i
65The key idea of our algorithm is to use trunctation, however we can see that it is a tradeoff. By truncating the walk, we lose information about how much longer this sampled return time would have been, yet on the other hand, we also save computation time.In the following slides we will analyze the effect of truncation on the estimate for different nodes.65AlgorithmGather Samples
Terminate if SatisfiedUpdate and RepeatDouble and increase such thatwith probability greater than ,
closeness of estimateconfidence
AlgorithmGather SamplesTerminate if SatisfiedTerminate and output current estimate if (a) Node is unimportant enough
(b) Fraction of truncated samples is small
Update and Repeat
threshold forimportanceLocal Computation [Lee-Ozd-Shah 13]
Correctness: under (a) [Lee-Ozd-Shah 13]
Correctness: under (b) [Lee-Ozd-Shah 13]
Simulation: PageRank
Nodes sorted by stationary probabilityStationaryprobability
Random graph using configuration model and power law degree distributionSimulation: PageRank
Nodes sorted by stationary probabilityStationaryprobability
Obtain close estimatesfor important nodesSimulation: PageRank
Nodes sorted by stationary probabilityStationaryprobability
corrects for the bias!
Bias CorrectionTrue value:
Estimate:
Fraction samples not truncated
SummaryNetwork centrality
Useful tool for data processing
A principled approachGraph score function = Likelihood function
An example: Rumor centralityAccurate source detection
Local ComputationStationary distribution of Markov chain/Random walk