some recent progress in combinatorial statistics

Some Recent Progress in Combinatorial Statistics

Elchanan Mossel,

UC Berkeley + Weizmann Institute

http://www.stat.berkeley.edu/~mossel

What is Combinatorial Statistics

• “Combinatorial Statistics” deals with Inference of

• Discrete parameters such as Graph, Trees or Partitions.

• With the goal of providing algorithms with guarantees on:

• Sampling Complexity, Computational Complexity and Success Probability.

• Alternatively: Sampling or Complexity Lower bounds.

Example 1: Graphical Model reconstruction

Markov random fields / Graphical Models

• A common model for stochastic networks

bounded degree graph G = (V, E)

weight functions C: | C |R≥0

for every clique C 1

2

3

4

(1, 2)

(2, 3)

(3, 4)

(1

, 3

)

Markov Random Fields / Graphical Models

• A common model for stochastic networks

nodes v are assigned values av in alphabet

Pr[] ~ C C(au, u 2 C)

bounded degree graph G = (V, E)

distribution over states V given by

weight functions C: | C |R≥0

for every edge clique C 1

2

3

4

(1, 2) ( , )

(2, 3)( ,

)

(3, 4)( , )

(1

, 3

)( ,

)

where the product goes over all Cliques C

Reconstruction task for Markov random fields

• Suppose we can obtain independent samples from the Markov random field

• Given observed data at the nodes, is it possible to reconstruct the model (network)?

• Important: Do we see data at all of the nodes or only a subset?

Problem: Given k independent samples of = (,…n) at all the nodes, find the graph G

(Given activity at all the nodes, can network be reconstructed?)

• Restrict attention to graphs with max degree d:

• A structure estimator is a map

Questions:

1. How many samples k are required (asymptotically) in order to reconstruct MRFs with number of nodes n, max degree d with probability approaching 1, I.e.

2. Want an efficient algorithm for reconstruction.

Reconstruction problem

• Tree Markov Fields can be reconstructed efficiently (even with hidden nodes).• [Erdös,Steel,Szekely,Warnow,99], [Daskalakis,Mossel,Roch,06].

• PAC Setup: [Abbeel,Koller,Ng, ‘06] produce a factorized distribution that is n close in Kullback-Leibler divergence to the true distribution.

• No guarantee to reconstruct the correct graph • Running time and sampling complexity is nO(d)

• More restricted problem studied by [Wainwright,Ravikumar,Lafferty, ‘06]• Restricted to Ising model, sample complexity (d5 log n), difficult to verify

convergence condition – technique based on L1 regularization. Moreover works for graphs not for graphical models! (clique potentials not allowed).

• Subsequent to our results, [Santhanam,Wainwright, ‘08] determine information theoretic sampling complexity and [Wainwright,Ravikumar,Lafferty, ‘08] get (d log n) sampling (restricted to Ising models; still no checkable guarantee for convergence).

Related work

Related work

Method Abeel et al

Wainwright et al Bresler et al.

Generative model

MRF General

Collection of Edges Ising

MRF

General

Reconstruct Dist of small KL Distance

Graph Graph

Additional conditions

No Yes (very hard to check)

No

Running time nd n5 nd

Sampling Complexity

poly(n) d5 log n

Later:

d log n

d log n

Theorem (Bresler-M-Sly-08; Lower bound on sample complexity): • In order to recover G of max-deg d need at least c d log n samples, for

some constant c.• Pf follows by “counting # of networks”; information theory lower bounds.• More formally: Given any prior distribution which is uniform over degree d

graphs (no restrictions on the potentials), in order to recover correct graph with probability ¸ 2-n need at least c d log n samples.

Reconstructing General Networks - New Results

Theorem (Bresler-M-Sly-08; Asymptotically optimal algorithm): • If distribution is “non-degenerate” c d log n samples suffice to reconstruct the model with probability ¸ 1 – 1/n100, for some (other) constant c.• Running time is nO(d)

• (sampling complexity tight up to a constant factor; running time – unknown)

• Observation: Knowing graph is same as knowing neighborhoods• But neighborhood is determined by Markov property• Same intuition behind work of Abeel et. al.

Intuition Behind Algorithms

“Algorithm”:

Step 1. Compute empirical probabilities,

which are concentrated around

true probabilities

Step 2. For each node, simply test

Markov property of each

candidate neighborhood

Main Challenge: Show non-degeneracy ) algorithm works

Example 2: Reconstruction of MRF with Hidden nodes

1

2

3

4

(1, 2) ( , )

(2, 3)( ,

)

(3, 4)( , )

(1

, 3

)( ,

)

• In many applications only some of the nodes can be observed

visible nodes W V

Markov random field overvisible nodes is

W = (w : w W)

• Is reconstruction still possible?

• What does “reconstruction” even mean?

Reconstruction versus distinguishing

• Easy to construct many models that lead to the same distribution (statistical unidentifiable)

• Assuming “this is not a problem” are there computational obstacles for reconstruction?

• In particular: how hard is it to distinguish statistically different models?

Distinguishing problems

• Let M1, M2 be two models with hidden nodes

• Can you tell if M1 and M2 are statistically close or far apart (on the visible nodes)?

• Assuming M1 and M2 are statistically far apart and given access to samples from one of them, can you tell where the samples came from?

PROBLEM 2

PROBLEM 1

Problems 1 and 2 are intractable (in the worst case) unless NP = RP

Hardness result with hidden nodes

• In Bogdanov-M-Vadhan-08:

• Conversely, if NP = RP then distinguishing (and other forms of reconstruction) are achievable

A possible objection

• The “hard” models M1, M2 describe distributions that are not efficiently samplable

• But if nature is efficient, we never need to worry about such distributions!

Distinguishing problem for samplable distributions

• If M1 and M2 are statistically far apart and given access to samples from one of them, can you tell where the samples came from, assuming M1 and M2 are efficiently samplable?

• Theorem

– We don’t know if this is tight

PROBLEM 3

Problem 3 is intractable unless computational zero knowledge is trivial

18

Example 3: Consensus Ranking and the Mallows Model

Problem Given a set of rankings {1, 2, ... N} find the consensus ranking (or central ranking)

for d = distance on the set of permutations of n objectsMost natural is dK which is the Kendall distance.

19

The Mallows Model

Exponential family model in :

P( | 0) = Z()-1 exp(- dK(,0))

´ 0 : uniform distribution over permutations

> 0 : 0 is the unique mode of P,0

Theorem [Meila,Phadnis,Patterson,Bilmes 07]

Consensus ranking (i.e ML estimation of 0 for constant ) can be solved

exactly by a branch and bound (B&B) algorithm.

The B&B algorithm can take (super-) exponential time in some cases

Seem to perform well on simulated data.

20

Related work

ML estimation [Fligner&Verducci 86] introduce generalized Mallows model, ML estimation [Fligner&Verducci 88] (FV) heuristic for 0 estimation

Consensus ranking [Cohen,Schapire,Singer 99] Greedy algorithm (CSS)

+ improvement by finding strongly connected components + missing data (not all i rank all n items)

[Ailon,Newman,Charikar 05] Randomized algorithm guaranteed 11/7 factor approximation (ANC)

[Mathieu, 07] (1+) approximation, time O(n6/+2^2O(1/))

[Davenport,Kalagnanan 03] Heuristics based on edge-disjoint cycles [Conitzer,D,K 05] Exact algorithm based on integer programming, better

bounds

Efficient Sorting of the Mallow’s model

• [Braverman-Mossel-09]:

• Given r independent samples from the Mallows Model, find ML solution exactly! in time nb,where

• b = 1 + O(( r)-1), • where r is the number of samples • with high probability (say ¸ 1-n-100)

Sorting the Mallow’s Model

• [Braverman-M-09]: Optimal order can be found in polynomial time and O(n log n) queries.

• Proof Ingredient 1: “statistical properties” of generated permutations i in terms of the original order 0 :

• With high probability: x |i(x)-(x)| = O(n),

max |i(x)-(x)| = O(log n)

•Additional ingredient: A dynamic programming algorithm to find given a starting point where each elements is at most k away with running time O(n 26k)

Example 4: MCMC Diagnostics

• Sampling from a joint probability distribution π on Ω.• Construct Markov chain X0, X1, …which converges to π.

• Multidimensional Integration• Bayesian inference (sampling from “posterior distributions” of the model)• Computational physics, • Computational biology,• Vision, Image segmentation.

Convergence Diagnostics

• A method to determine t such that Pt(X0, ) is “sufficiently close” to π. - MCMC in Practice, Gilks, Richardson & Spiegelhalter

•

Are we there yet?

Convergence Diagnostics in the Literature

Visual inspection of functions of samples.- MCMC in Practice, Gilks, Richardson & Spiegelhalter

[Raftery-Lewis]Bounding the variance of estimates of quantiles of functions of parameters.

[Gelman-Rubin]Convergence achieved if the separate empirical distributions are approximately the same as the combined distribution.

[Cowles-Carlin]• Survey of 13 diagnostics designed to identify specific types of convergence failure.• Recommend combinations of strategies – running parallel chains, computing autocorrelations.

Diagnostic Algorithm

X Y

Input:

An MC P on Ω {0,1}n and a time t such that either • T(1/4) < t • T(1/4) > 100t

Output: • “Yes” if T(1/4) < t • “No” T(1/4) > 100t

Diagnostic algorithm A:

Results - The General CaseTheorem [Bhatnagar-Bogdanov-M-09]: Given

1. A MC P on Ω {0,1}n 2. A time t such that either • T(1/4) < t • T(1/4) > 100t

It is a PSPACE-complete problem to distinguish the two.

PSPACE complete problems

[Crasmaru-Tromp] Ladder problems in GO – does black have a winning strategy?

PSPACE hard problems are “much harder” the NP-hard problems.

Results - Polynomial Mixing

Theorem 2 [Bhatnagar-Bogdanov-M-09]: Given

1. A MC P on Ω {0,1}n with a promise that T(1/4) < poly(n) 2. A time t such that either • T(1/4) < t • T(1/4) > 100t• It is as hard as any coNP problem to distinguish the two.

(x0+x3+x7)^(x2+x7+xn) …^(…)

(x6+x3+x2)^(x0+x7+x1) …^(…)

T(1/4) < t

T(1/4) > 100t

Results - Mixing from a Given State

Theorem 3 [Bhatnagar-Bogdanov-M-09]: : Given

1. A MC P on Ω {0,1}n such that T(1/4) < poly(n) 2. A starting state X0, 3. A time t such that either • TX0

(1/4) < t

• TX0(1/4) > 100t

A diagnostic algorithm can be used to distinguish whether two efficiently sampleable distributions are close or far in statistical distance.

[Sahai-Vadhan] “Statistical Zero Knowledge” - complete

a1 am…

r1 rn…

C1

a1 am…

r’1 r’n…

C2

Comb. Stat. vs. Common ML Approaches

• Some features of Combinatorial Statistics in comparison to ML:

• +: Using explicit well defined models.

• +: Explicit non-asymptotic running time / sampling bounds.

• - : Often not “statistically efficient”.

• - : Inputs must be generated from correct dist. or close to it (but: often: NP hard otherwise; so true for of all methods)

• +/- : Need to understand combinatorial structure for each new problem vs:

• Applying general statistical techniques and structure is hidden in programming/design etc.

Conclusions and Questions

• [Cowles-Carlin] Survey of 13 diagnostics designed to identify specific types of convergence failure. Each can fail to detect the type of convergence it was designed for.

• [Arora-Barak] Efficient algorithms are not believed to exist for PSPACE-complete, coNP-complete or SZK-complete problems.

• Diagnostic algorithms do not exist for large classes of MCMC algorithms, like Gibbs samplers unless there are efficient algorithms for PSPACE or coNP or SZK.

• For what classes of samplers are convergence diagnostics possible?

• Hardness of testing convergence for more restricted classes of samplers.

Collaborators

Salil Vadhan:Harvard

Andrej Bogdanov:Hong-Kong

Guy BreslerBerkeley

Allan SlyBerkeley

Nayantara Bhatnagar, Berkeley

Mark BravermanMicrosoft

Thanks !!Thanks !!

some recent progress in combinatorial statistics

Documents

c d log n samples

d log n sampling

constant c

c bounded degree graph

max degree d

degree d graphs

edge clique c

sample complexity d5