hypothesis testing for structured probability...
TRANSCRIPT
Hypothesis Testing for Structured Probability Distributions
Ilias Diakonikolas USC
Joint work with Daniel Kane (UCSD)
Vladimir Nikishkin (Edinburgh)
What this talk is about
Basic object of study: Probability distributions over ordered domain: or
Notation: p, q: either pmf or pdf
[n] = 1, . . . , n I = [a, b] ⊆ R
Menu Explaining the title: • Let be a family of probability distributions
• Identity Testing Problem: − Distinguish between the cases p=q and dist (p, q) > ε − Minimize sample size, computation time
Unknown 1, 2, 2, 4, 3,…
Known/Unknown
2, 1, 2, 3, 1,…
Total Varia0on Distance dTV(p, q) = (1/2)p− q1
D
p ∈ D
q ∈ D
This Talk
Unified Framework for Identity Testing: Leads to sample-optimal and computationally efficient
estimators for a variety of structured distribution families.
& (Matching Information-Theoretic Lower Bounds)
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
Distribution Testing (Hypothesis Testing)
Given samples (observations) from one (or more) unknown probability distribution(s) (model), decide whether it satisfies a certain property. • Introduced by Karl Pearson (’99). • Classical Problem in Statistics [Neyman-Pearson’33, Lehman-Romano’05]
• Last twenty years (TCS): property testing [Goldreich-Ron’00, Batu et al. FOCS’00/JACM’13]
Related Work – Property Testing (I)
Focus has been on arbitrary distributions over support of size . Testing Identity to an explicitly known Distribution: • [Goldreich-Ron’00]: upper bound for uniformity testing
(collision statistics) • [Batu et al., FOCS’01]: upper bound for testing
identity to any known distribution.
• [Paninski ’03]: upper bound of for uniformity testing, assuming . Lower bound of .
• [Valiant-Valiant, FOCS’14, D-Kane-Nikishkin, SODA’15]: upper bound of for identity testing to any known distribution.
n
O(√n/
4)
O(√n) · poly(1/)
O(√n/
2) = Ω(n−1/4) Ω(
√n/2)
O(√n/
2)
Related Work – Property Testing (II)
Focus has been on arbitrary distributions over support of size . Testing Closeness between two unknown distributions: • [Batu et al., FOCS’00]: upper bound for testing closeness between two unknown discrete distributions.
• [P. Valiant, STOC’08]: lower bound of for constant error.
• [Chan-D-Valiant-Valiant, SODA’14]: tight upper bound and lower bound of
n
O(n2/3 log n/8/3)
Ω(n2/3)
O(maxn2/3/
4/3, n
1/2/
2)
Summary of Related Work
Testing Closeness
Tight Bound
[Chan-D-Valiant-Valiant’14]
Learning Tight Bound
[folklore]
Testing Identity
Tight Bound
[Valiant-Valiant’14, D-Kane-Nikishkin’15]
support size: , total variation distance error: n
Θ(maxn2/3/4/3, n1/2/2)
Θ(n1/2/2)
Θ(n/2)
Estimating Structured Distributions
• Statistical Estimation well-understood for arbitrary discrete distributions.
• How about for structured distributions? • Long line of work in statistics since the 1950’s [Grenander’56, Rao’69,
Wegman’70, Birge’87,…]. Focus has been on density estimation (learning).
• [Batu-Kumar-Rubinfeld, STOC’04]: identity testing of monotone distributions.
• [Daskalakis-D-Servedio-Valiant-Valiant, SODA’13]: generalization to
k-modal distributions.
Types of Structured Distributions
bimodal
log-‐concave
monotone • Distributions with “shape restrictions”
• Simple combinations of simple distributions
mixtures of Gaussians
Mixtures of simple distributions
Sums of simple distributions
+ + … + Poisson Binomial Distribu9ons
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
First Step: Changing the metric
Identity Testing Problem for family Given (sample) access to : • Output “YES” (with high probability) if (completeness) • Output “NO” (with high probability) if (soundness)
Reduces to Identity Testing Problem under - distance Given (sample) access to : • Output “YES” (with high probability) if • Output “NO” (with high probability) if
Dp, q ∈ D
p = qp− q1 ≥
Ak
p, qp = q
p− qAk ≥
-Distance between Distributions (I)
Definition. For and , we define the - distance between as follows: Facts: • For , (essentially) equivalent to the Kolmogorov distance.
• For any , we have .
• We have:
Ak
p, q
I1 I2 I3 Ik
p− qAk = supI=(Ii)ki=1
k
i=1
|p(Ii)− q(Ii)|
p, q : R → [0, 1]
Ak
k ≥ 2
k = 2
k ≥ 2 p− qAk ≤ p− q1
limk→∞
p− qAk = p− q1
-Distance between Distributions (II)
Definition. For and , we define the - distance between as follows: Upper Bound on Sample Complexity: For a family of one-dimensional distributions and , let be the smallest integer such that for any it holds Then, the parameter is the “right” complexity measure for estimating a property of the family .
k ≥ 1 Ak
p, q
D > 0p, q ∈ D
D
k = k(D, )
k
I1 I2 I3 Ik
p− qAk = supI=(Ii)ki=1
k
i=1
|p(Ii)− q(Ii)|
p, q : R → [0, 1]
Ak
p− q1 ≤ p− qAk + /2.
for each :
Overview of Framework
Approximation (Existential Step)
Identity Tester under - distance
(Algorithmic Step) > 0
YES/NO
Error parameter:
Ak
k = k(D, )
min k s.t.p, q ∈ D
p− q1 ≤ p− qAk + /2. = /2
L1-Identity Tester for D
D
Second Step: Design -Distance Tester
Identity Testing Problem under - distance Given (sample) access to : • Output “YES” (with high probability) if • Output “NO” (with high probability) if Two fundamentally different regimes: • One of known explicitly [Testing Identity to Fixed Distribution]. • Both unknown [Testing Closeness].
Ak
p, q
p = qp− qAk ≥
Ak
p, q
p, q
-distance vs L1 distance
Testing Closeness Tight Bound
Support: [n], L1 distance
- distance
Learning Tight Bound Support: [n], L1 distance
- distance
Testing Identity Tight Bound Support: [n], L1 distance
- distance
Θ(maxn2/3/4/3, n1/2/2)
Θ(n1/2/2)
Θ(n/2)
Ak
Ak
Ak Θ(k/2)
Θ(k1/2/2)
Ak
[VC]
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
-Testing Identity to Fixed Distribution
Theorem [D-Kane-Nikishkin’15] For any , , and any explicit there exists a computationally efficient algorithm that distinguishes between the case versus with constant error probability using samples from . Moreover, this sample size is information-theoretically necessary for this task. Remark: • The upper bound holds both for discrete and continuous distributions.
> 0 k ≥ 2
p = q p− qAk ≥
q
O(k1/2/2)
p
Ak
Applications: L1 -Identity Testing for Structured Distributions
Distribution Family
Sample Size Parameters
t-flat
t-piecewise degree-d
Log-concave
Log-concave t-mixture
t-modal over [n]
MHR over [n]
k = O(t)
k = O(t(d+ 1))
O(−9/4) k = O(−1/2)
k = O(t−1/2)
k = O(t log(n)/)
k = O(log(n)/)
O
t1/2
2
O
(t(d+ 1))1/2
2
O
t1/2
9/4
O
(t log n)1/2
5/2
O
(log n)1/2
5/2
- Identity Testing: Basic Facts
Lemma: Identity testing reduces to uniformity testing. Proof Idea: Appropriately “stretch” the domain size. Henceforth, focus on uniformity testing. Observation: If we know the partition maximizing the discrepancy, can reduce to L1- identity testing over domain of size k.
J1 J2 Jk
p− UAk =k
j=1
|p(Jj)− U(Jj)|
Ak
- Uniformity Testing: First Approach
• Partition the domain into intervals of equal
length. • Apply an L1- uniformity tester on the reduced distributions over
these intervals.
Claim: Sample Complexity:
= 10k/ I1, . . . , I
p− UAk − /2 ≤
i=1
|p(Ii)− U(Ii)| ≤ p− UAk
I1 I2 I3 I
J1 J2 Jk
O(1/2/2) = O(k1/2/5/2)
Ak
- Uniformity Testing: Optimal Algorithm
• Construct several oblivious decompositions of the domain. • Use L2- uniformity tester over the reduced distributions.
In more detail: • Consider equal-length interval partitions of the domain.
Partition consists of intervals. • For each j, apply an L2- uniformity tester with L2 - error • Accept if and only if all testers accept. Structural Lemma: One of the partitions will detect the discrepancy.
Ak
M = log(1/)I(j) j = k · 2j
j = · 23j/8/1/2j
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
-distance vs L1 distance
Testing Closeness Tight Bound
Support: [n], L1 distance
- distance
Learning Tight Bound Support: [n], L1 distance
- distance
Testing Identity Tight Bound Support: [n], L1 distance
- distance
Θ(maxn2/3/4/3, n1/2/2)
Θ(n1/2/2)
Θ(n/2)
Ak
Ak
Ak Θ(k/2)
Θ(k1/2/2)
Θ(maxk2/3/4/3, k1/2/2)
Ak
[VC]
- Equivalence Testing
Theorem For any and , and any distributions there exists a computationally efficient algorithm that distinguishes between the case versus with constant error probability using samples. Moreover, this sample size is information-theoretically necessary for this task. Remarks: • The upper bound holds both for discrete and continuous distributions.
• The lower bound applies to continuous distributions or discrete distributions over a domain of size .
> 0 k ≥ 2 p, q
p = q p− qAk ≥
O(maxk4/5/6/5, k1/2/2)
N ≥ 2poly(k)
Ak
- Closeness Testing: Basic Facts
• No oblivious decomposition can work: Discrepancy may be hidden in intervals even though reduced distributions are the same.
• Can partition the domain into “light” intervals, and apply standard
closeness tester on reduced distributions over these intervals. • Inherently leads to sample algorithms: Need adaptive
partition in which at least one distribution has small mass.
• How do we obtain sample size?
Ak
o(k)
Ω(k)
- Closeness Testing Algorithm
Consider the following “order-based” algorithm: • Let . Draw samples from p,
and samples from q.
• Let be the union of and sorted in increasing order.
• Let
• If , return “NO”; otherwise, return “YES.”
m = O(k4/5/6/5) m1 = Poi(m)m2 = Poi(m)
Sp
Sq
S Sp Sq
Z = #(pairs of consecutive elements of S from same distribution)−#(pairs of consecutive elements of S from different distributions)
Z > 3√m
Ak
Closeness Testing: Sketch of Analysis
• Bound mean and variance and using concentration.
• Completeness: and • Soundness: Main technical step bounding from below.
- Easy to argue: - Highly non-trivial:
E[Z] = 0
E[Z]
Var[Z] = O(m)
E[Z] = Ω(m33/k2)
Var[Z] = 2m− 1
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
Future Directions
Unified Technique for Identity Testing: Use - distance as a proxy. Concrete Open Problems: • Understanding the regime [DKN’16]. • Testing Other Properties of Structured Distributions: Independence, Entropy, etc. A Few Open-ended Challenges: • Other Criteria: Privacy, Communication • High-Dimensional Structured Distributions • Tradeoffs between sample size and computational efficiency?
Thank you for your attention!
log n ≤ k ≤ n
Ak
Sketch of Lower Bound (I)
• Suppose algorithm only considers ordering of the samples. • Consider the following instance:
• If less than 3 samples land in a mini-bucket, no useful information.
…
p pq p = q
2k
k
2k
1−
k
Sketch of Lower Bound (II)
• If less than 3 samples land in a mini-bucket, no useful information
for an order-based tester. • Expected number of buckets with 3 samples • Need this quantity to be How about for general testers? • Can embed above instance into larger domain, so that ordered-
based testers suffice. • Non-constructive argument (Ramsey’s theorem).
…
p pq p = q
2k
k
2k
1−
k
km
k
3
√m