network degree distribution inference under samplingmath.bu.edu/keio2016/talks/goeva.pdf · network...
TRANSCRIPT
Network Degree Distribution Inference Under Sampling
Aleksandrina Goeva
1
Eric Kolaczyk
1
Rich Lehoucq
2
1
Department of Mathematics and Statistics,
Boston University
2
Sandia National Labs
August 18, 2016
BU/Keio, Boston, MA
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Motivation
Sampling introduces randomness in the sampled network.
Sampled network characteristics may not represent those of the true
network well.
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Sampling Mechanisms - Examples
1
1
Kolaczyk (2009)
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Setup
Problem introduced by Frank (1968)
EN⇤= PN
N = (N
0
,N1
, . . . ,NM), is the degree counts vector of the true
network,
N
⇤= (N
⇤0
,N⇤1
, . . . ,N⇤M), is the degree counts vector of the sampled
network,
P is a linear operator that depends fully on the sampling scheme and
not on the network itself, and
M is the maximum degree in the true network G
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Naive Solution Issues
bNnaive = P
�1
N
⇤
P is typically non-invertible.
Solutions may not be non-negative.
2
2
Zhang, Kolaczyk, Spencer (2015)
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Problem Formulation
N
⇤= PN + E
Ill-posed linear inverse problem.
P is not random, depends only on the sampling design.
E is the noise due to sampling.
E[E ] = 0
E[EET] = C
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Proposed Approach
Complexity Functional
K (
eN, ·) = (P
eN � ·)TC�1
(P
eN � ·) + �||D e
N||22
where � is a regularization parameter
D is a second-order di↵erencing operator
C = Cov(N
⇤) = E[EET
]
Look for a constrained solution
eN 2 C := { eN :
eN � 0 and 1
T eN = nv}
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
C-constrained Minimum Empirical Complexity Estimate
Constrained Penalized Weighted Least Squares
min
eN(P
eN � N
⇤)
TC
�1
(P
eN � N
⇤) + �||D e
N||22
subject to
eN 2 C
C-constrained minimum empirical complexity estimate:
bN = argmin
eN2CK (
eN,N⇤
)
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Quality of the Solution
We aim to upperbound the risk:
E[||P bN � PN||2C�1
]
= E[(P bN � PN)
TC
�1
(P
bN � PN)]
E[K (
bN,PN)]
K (N
0,PN) + 2 E[< C
�1/2E ,C�1/2(P
bN � PN
0
) >]
where
N
0
= argmin
eN2CK (
eN,PN)
is the C-constrained minimizer of theoretical complexity.
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
First Term
K (N
0,PN)
This is the minimum theoretical complexity.
This term is not random.
Bounded in terms of a functional of the sampling design.
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Di↵erent Regimes
Underlying all sampling mechanisms there is a fundamental quantity p
controlling the rate of sampling.
The problem behaves di↵erently depending on the values of p.
We identify three regimes:
p = 1: full information - trivial case, no noise, P is diagonal.
small p: the distribution of E is approximately Poisson.
moderate p: the distribution of E is approximately Normal.
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Di↵erent Regimes - Small p
Small p ⇡ 10% to 20%: the distribution of E is appoximately Poisson.
5 10 15
510
15
Poisson Theoretical Quantiles
Sam
ple
Qua
ntile
s
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Di↵erent Regimes - Moderate p
Moderate p ⇡ 30% to 60% the distribution of E is appoximately
Normal.
−3 −2 −1 0 1 2 3
−20
−10
010
20
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Second Term
E[< C
�1/2E ,C�1/2(P
bN � PN
0
) >]
E"
sup
P bN�PN02set< C
�1/2E ,C�1/2(P
bN � PN
0
) >
#
Under the moderate p regime, the distribution of E is reasonably
close to Gaussian.
Assuming the entries of C
�1/2E are independent standard Gaussian,
we can bound this term using Gaussian widths.
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling
Summary
Motivation:
Problem arises in the context of sampled networks.
Under many sampling designs the expectation of the sampled degree
distribution is the product of a design-dependent matrix and the true
underlying degree distribution.
Main Idea:
Unusual ill-conditioned linear inverse problem.
The empirical analysis of Zhang, et al. (2015) of the constrained
penalized weighted least squares solution is the first non-parametric
approach to the problem since it was proposed ⇠ 35 years ago.
To our knowledge, our work is the first attempt to produce theoretical
guarantees on the performance of the proposed solution.
Thank you!
A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling