network degree distribution inference under samplingmath.bu.edu/keio2016/talks/goeva.pdf · network...

Network Degree Distribution Inference Under Sampling

Aleksandrina Goeva

1

Eric Kolaczyk

1

Rich Lehoucq

2

1

Department of Mathematics and Statistics,

Boston University

2

Sandia National Labs

August 18, 2016

BU/Keio, Boston, MA

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Motivation

Sampling introduces randomness in the sampled network.

Sampled network characteristics may not represent those of the true

network well.


Sampling Mechanisms - Examples

1

1

Kolaczyk (2009)


Setup

Problem introduced by Frank (1968)

EN⇤= PN

N = (N

0

,N1

, . . . ,NM), is the degree counts vector of the true

network,

N

⇤= (N

⇤0

,N⇤1

, . . . ,N⇤M), is the degree counts vector of the sampled

network,

P is a linear operator that depends fully on the sampling scheme and

not on the network itself, and

M is the maximum degree in the true network G


Naive Solution Issues

bNnaive = P

�1

N

⇤

P is typically non-invertible.

Solutions may not be non-negative.

2

2

Zhang, Kolaczyk, Spencer (2015)


Problem Formulation

N

⇤= PN + E

Ill-posed linear inverse problem.

P is not random, depends only on the sampling design.

E is the noise due to sampling.

E[E ] = 0

E[EET] = C


Proposed Approach

Complexity Functional

K (

eN, ·) = (P

eN � ·)TC�1

(P

eN � ·) + �||D e

N||22

where � is a regularization parameter

D is a second-order di↵erencing operator

C = Cov(N

⇤) = E[EET

]

Look for a constrained solution

eN 2 C := { eN :

eN � 0 and 1

T eN = nv}


C-constrained Minimum Empirical Complexity Estimate

Constrained Penalized Weighted Least Squares

min

eN(P

eN � N

⇤)

TC

�1

(P

eN � N

⇤) + �||D e

N||22

subject to

eN 2 C

C-constrained minimum empirical complexity estimate:

bN = argmin

eN2CK (

eN,N⇤

)


Quality of the Solution

We aim to upperbound the risk:

E[||P bN � PN||2C�1

]

= E[(P bN � PN)

TC

�1

(P

bN � PN)]

E[K (

bN,PN)]

K (N

0,PN) + 2 E[< C

�1/2E ,C�1/2(P

bN � PN

0

) >]

where

N

0

= argmin

eN2CK (

eN,PN)

is the C-constrained minimizer of theoretical complexity.


First Term

K (N

0,PN)

This is the minimum theoretical complexity.

This term is not random.

Bounded in terms of a functional of the sampling design.


Di↵erent Regimes

Underlying all sampling mechanisms there is a fundamental quantity p

controlling the rate of sampling.

The problem behaves di↵erently depending on the values of p.

We identify three regimes:

p = 1: full information - trivial case, no noise, P is diagonal.

small p: the distribution of E is approximately Poisson.

moderate p: the distribution of E is approximately Normal.


Di↵erent Regimes - Small p

Small p ⇡ 10% to 20%: the distribution of E is appoximately Poisson.

5 10 15

510

15

Poisson Theoretical Quantiles

Sam

ple

Qua

ntile

s


Di↵erent Regimes - Moderate p

Moderate p ⇡ 30% to 60% the distribution of E is appoximately

Normal.

−3 −2 −1 0 1 2 3

−20

−10

010

20

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s


Second Term

E[< C

�1/2E ,C�1/2(P

bN � PN

0

) >]

E"

sup

P bN�PN02set< C

�1/2E ,C�1/2(P

bN � PN

0

) >

#

Under the moderate p regime, the distribution of E is reasonably

close to Gaussian.

Assuming the entries of C

�1/2E are independent standard Gaussian,

we can bound this term using Gaussian widths.


Summary

Motivation:

Problem arises in the context of sampled networks.

Under many sampling designs the expectation of the sampled degree

distribution is the product of a design-dependent matrix and the true

underlying degree distribution.

Main Idea:

Unusual ill-conditioned linear inverse problem.

The empirical analysis of Zhang, et al. (2015) of the constrained

penalized weighted least squares solution is the first non-parametric

approach to the problem since it was proposed ⇠ 35 years ago.

To our knowledge, our work is the first attempt to produce theoretical

guarantees on the performance of the proposed solution.

Thank you!


network degree distribution inference under samplingmath.bu.edu/keio2016/talks/goeva.pdf · network...

Documents