network degree distribution inference under samplingmath.bu.edu/keio2016/talks/goeva.pdf · network...

15
Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk 1 Rich Lehoucq 2 1 Department of Mathematics and Statistics, Boston University 2 Sandia National Labs August 18, 2016 BU/Keio, Boston, MA A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Upload: votuong

Post on 06-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Network Degree Distribution Inference Under Sampling

Aleksandrina Goeva

1

Eric Kolaczyk

1

Rich Lehoucq

2

1

Department of Mathematics and Statistics,

Boston University

2

Sandia National Labs

August 18, 2016

BU/Keio, Boston, MA

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 2: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Motivation

Sampling introduces randomness in the sampled network.

Sampled network characteristics may not represent those of the true

network well.

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 3: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Sampling Mechanisms - Examples

1

1

Kolaczyk (2009)

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 4: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Setup

Problem introduced by Frank (1968)

EN⇤= PN

N = (N

0

,N1

, . . . ,NM), is the degree counts vector of the true

network,

N

⇤= (N

⇤0

,N⇤1

, . . . ,N⇤M), is the degree counts vector of the sampled

network,

P is a linear operator that depends fully on the sampling scheme and

not on the network itself, and

M is the maximum degree in the true network G

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 5: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Naive Solution Issues

bNnaive = P

�1

N

P is typically non-invertible.

Solutions may not be non-negative.

2

2

Zhang, Kolaczyk, Spencer (2015)

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 6: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Problem Formulation

N

⇤= PN + E

Ill-posed linear inverse problem.

P is not random, depends only on the sampling design.

E is the noise due to sampling.

E[E ] = 0

E[EET] = C

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 7: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Proposed Approach

Complexity Functional

K (

eN, ·) = (P

eN � ·)TC�1

(P

eN � ·) + �||D e

N||22

where � is a regularization parameter

D is a second-order di↵erencing operator

C = Cov(N

⇤) = E[EET

]

Look for a constrained solution

eN 2 C := { eN :

eN � 0 and 1

T eN = nv}

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 8: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

C-constrained Minimum Empirical Complexity Estimate

Constrained Penalized Weighted Least Squares

min

eN(P

eN � N

⇤)

TC

�1

(P

eN � N

⇤) + �||D e

N||22

subject to

eN 2 C

C-constrained minimum empirical complexity estimate:

bN = argmin

eN2CK (

eN,N⇤

)

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 9: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Quality of the Solution

We aim to upperbound the risk:

E[||P bN � PN||2C�1

]

= E[(P bN � PN)

TC

�1

(P

bN � PN)]

E[K (

bN,PN)]

K (N

0,PN) + 2 E[< C

�1/2E ,C�1/2(P

bN � PN

0

) >]

where

N

0

= argmin

eN2CK (

eN,PN)

is the C-constrained minimizer of theoretical complexity.

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 10: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

First Term

K (N

0,PN)

This is the minimum theoretical complexity.

This term is not random.

Bounded in terms of a functional of the sampling design.

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 11: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Di↵erent Regimes

Underlying all sampling mechanisms there is a fundamental quantity p

controlling the rate of sampling.

The problem behaves di↵erently depending on the values of p.

We identify three regimes:

p = 1: full information - trivial case, no noise, P is diagonal.

small p: the distribution of E is approximately Poisson.

moderate p: the distribution of E is approximately Normal.

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 12: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Di↵erent Regimes - Small p

Small p ⇡ 10% to 20%: the distribution of E is appoximately Poisson.

5 10 15

510

15

Poisson Theoretical Quantiles

Sam

ple

Qua

ntile

s

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 13: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Di↵erent Regimes - Moderate p

Moderate p ⇡ 30% to 60% the distribution of E is appoximately

Normal.

−3 −2 −1 0 1 2 3

−20

−10

010

20

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 14: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Second Term

E[< C

�1/2E ,C�1/2(P

bN � PN

0

) >]

E"

sup

P bN�PN02set< C

�1/2E ,C�1/2(P

bN � PN

0

) >

#

Under the moderate p regime, the distribution of E is reasonably

close to Gaussian.

Assuming the entries of C

�1/2E are independent standard Gaussian,

we can bound this term using Gaussian widths.

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling

Page 15: Network Degree Distribution Inference Under Samplingmath.bu.edu/keio2016/talks/Goeva.pdf · Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk

Summary

Motivation:

Problem arises in the context of sampled networks.

Under many sampling designs the expectation of the sampled degree

distribution is the product of a design-dependent matrix and the true

underlying degree distribution.

Main Idea:

Unusual ill-conditioned linear inverse problem.

The empirical analysis of Zhang, et al. (2015) of the constrained

penalized weighted least squares solution is the first non-parametric

approach to the problem since it was proposed ⇠ 35 years ago.

To our knowledge, our work is the first attempt to produce theoretical

guarantees on the performance of the proposed solution.

Thank you!

A. Goeva, E. Kolaczyk, R. Lehoucq Network Degree Distribution Inference Under Sampling