g nonparametric density estimation under ipm lossesnonparametric density estimation under ipm losses...

1
Nonparametric Density Estimation under IPM Losses with Statistical Convergence Rates for Generative Adversarial Networks (GANs) Ananya Uppal, Shashank Singh, Barnab´ as P ´ oczos Link to paperIntroduction Nonparametric distribution estimation: Given n IID samples X 1:n = X 1 , ..., X n IID P from an unknown distribution P , we want to estimate P . Important sub-routine of many statistical methods Usually analyzed in terms of L 2 loss * Severe curse of dimensionality We provide unified minimax-optimal estimation rates under large family of losses called Integral Probability Metrics (IPMs), for many function classes (Sobolev, Besov, RKHS). Includes most common metrics on probability distributions Implicitly used in Generative Adversarial Networks (GANs) * Allows us to derive statistical guarantees for GANs Reduced curse of dimensionality Integral Probability Metrics (IPMs) Definition 1 (IPM). Let P be a class of probability distributions on a sample space X , and F a class of (bounded) functions on X . Then, the metric ρ F : P×P→ [0, ] on P is defined by ρ F (P, Q) := sup f ∈F E X P [f (X )] - E X Q [f (X )] . Definition 2 (Besov Ball). Let β j,k denote coefficients of a function f in a wavelet basis indexed by j N,k [2 j ]. For parameters σ 0, p, q [1, ], f ∈L 2 lies in the Besov ball B σ p,q iff kf k B σ p,q := n 2 j (σ +D(1/2-1/p)) {β λ } λΛ j l p o j N l q 1. The parameter q affects convergence rates only by logarithmic factors, so we omit it in sequel. Examples of IPMs Distance F L p (including Total Variation/L 1 ) B 0 p 0 , with p 0 = p p-1 Wasserstein (“earth-mover”) B 1 (1-Lipschitz class) Kolmogorov-Smirnov B 1 1 (total variation 1) Max. mean discrepancy (MMD) RKHS ball GAN Discriminator parameterized by neural network Figure 1: Examples of probability distributions P and Q and corresponding discriminator functions f * . In (a), P and Q are Dirac masses at +1 and -1, resp., and F is the 1-Lipschitz class, so that ρ F is the Wasserstein metric. In (b), P and Q are standard Gaussian and Laplace distributions, resp., and F is a ball in an RKHS with a Gaussian kernel. Minimax Rates for General Estimators Theorem 1. Suppose σ g D/p g , p 0 d >p g . Then, up to polylog factors in n, M B σ g p g ,B σ d p d := inf b p sup pB σ g p g E X 1:n h ρ B σ d p d (p, b p(X 1:n )) i n - σ g +σ d 2σ g +D + n - σ g +σ d +D-D/p g -D/p 0 d 2σ g +D-2D/p g + n - 1 2 . Moreover, this rate is achieved by the wavelet thresholding estimator of Donoho etal. [1]. Minimax Rates for Linear Estimators Definition 3 (Linear Estimator). A distribution estimate b P is said to be linear if there exist mea- sures T i (X i , ·) such that for all measurable A, b P (A)= n X i=1 T i (X i ,A). Examples: empirical distribution, kernel density estimate, or orthogonal series estimate. Theorem 2. Suppose r>σ g D/p g . Then, up to polylog factors in n, M lin B σ g p g ,B σ d p d := inf b p lin sup pB σ g p g E X 1:n h ρ B σ d p d (p, b p(X 1:n )) i n - σ g +σ d 2σ g +D + n - σ g +σ d -D/p g +D/p 0 d 2σ g +D-2D/p g +2D/p 0 d + n - 1 2 , where the inf is over all linear estimates of p ∈F g , and μ p is the distribution with density p. Error Bounds for GANs A natural statistical model for a perfectly optimized GAN as a distribution estimator is b P := argmin Q∈F g sup f ∈F d E X Q [f (X )] - E X e P n [f (X )] , (1) where F d and F g are function classes parametrized by the discriminator and generator, resp [2]. Theorem 3 (Convergence Rate of a Regularized GAN). Fix a Besov density class B σ g p g with σ g > D/p g and discriminator class B σ d p d . Then, for some constant C> 0 depending only on B σ d p d and B σ g p g , for any desired approximation error > 0, one can construct a GAN b P of the form (1) (with e P n denoting the wavelet-thresholded distribution) whose discriminator network N d and generator network N g are fully-connected ReLU networks, such that sup P B σ g p g E h d B σ d p d b P,P i . + n -η (D,σ d ,p d g ,p g ) , where η (D, σ d ,p d g ,p g ) is the optimal exponent in Theorem 1. N d and N g have (rate-optimal) depth polylog( 1 / ) and width, max weight, and sparsity poly( 1 / ). • Proof uses recent fully-connected ReLU network for approximating Besov functions [3]. Example Phase Diagrams General Estimators Linear Estimators Figure 2: Minimax convergence rates as functions of discriminator smoothness σ d and distribution function smooth- ness σ g , in the case D =4, p d =1.2, p g =2. Color shows exponent of minimax convergence rate (i.e., α(σ d g ) such that M ( B σ d 1.2 (R D ),B σ g 2 (R D ) ) n -α(σ d g ) ), ignoring polylogarithmic factors. Applications/Examples Example 1. (Total variation/Wasserstein-type losses) If, for some σ d > 0, F is a ball in B s , we obtain generalizations of total variation (σ d =0) and Wasserstein (σ d =1) losses. For these losses, we always have “Dense” rate M B σ g p g ,B σ d p d n - σ g +σ d 2σ g +D + n -1/2 . Example 2. (Kolmogorov-Smirnov-type losses) If, for σ d > 0, F is a ball in B σ d 1 , we obtain generalizations of Kolmogorov-Smirnov loss (σ d =0). For these losses, we have “Sparse” rate M B σ g p g ,B σ d 1 n - σ g +σ d -D/p g 2σ g +D-2D/p g + n -1/2 . Example 3. (Maximum Mean Discrepancy) If F is a ball of radius L in a reproducing kernel Hilbert space with translation invariant kernel K (x, y )= κ(x - y ) for some κ ∈L 2 (X ), then, sup P Borel E h ρ F P, b P i Lkκk L 2 (X ) n . Example 4. (Sobolev IPMs) For σ N, B σ 2 is the σ -order Hilbert-Sobolev ball B σ 2 = f ∈L 2 (X ): R X f (σ ) (x) 2 dx ≤∞ , where f (σ ) is the σ th derivative of f . For these losses, we always have the rate M B σ g 2 ,B σ d 2 n - σ g +σ d 2σ g +D + n -1/2 . (note that n -1/2 dominates t 2d). References [1] David L Donoho, Iain M Johnstone, G´ erard Kerkyacharian, and Dominique Picard. Density estimation by wavelet thresholding. The Annals of Statistics, pages 508–539, 1996. [2] Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabas Poc- zos. Nonparametric density estimation under adversarial losses. In NeurIPS, 2018. [3]Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033, 2018.

Upload: others

Post on 30-Jan-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: g Nonparametric Density Estimation under IPM LossesNonparametric Density Estimation under IPM Losses with Statistical Convergence Rates for Generative Adversarial Networks (GANs) Ananya

Nonparametric Density Estimation under IPM Losseswith Statistical Convergence Rates for Generative Adversarial Networks (GANs)

Ananya Uppal, Shashank Singh, Barnabas Poczos↑Link to paper↑

Introduction• Nonparametric distribution estimation: Given n IID samples X1:n = X1, ..., Xn

IID∼ P froman unknown distribution P , we want to estimate P .– Important sub-routine of many statistical methods– Usually analyzed in terms of L2 loss

* Severe curse of dimensionality• We provide unified minimax-optimal estimation rates under large family of losses called

Integral Probability Metrics (IPMs), for many function classes (Sobolev, Besov, RKHS).– Includes most common metrics on probability distributions– Implicitly used in Generative Adversarial Networks (GANs)

* Allows us to derive statistical guarantees for GANs– Reduced curse of dimensionality

Integral Probability Metrics (IPMs)

Definition 1 (IPM). Let P be a class of probability distributions on a sample space X , and F aclass of (bounded) functions on X . Then, the metric ρF : P × P → [0,∞] on P is defined by

ρF (P,Q) := supf∈F

∣∣∣∣∣ EX∼P

[f (X)]− EX∼Q

[f (X)]

∣∣∣∣∣ .

Definition 2 (Besov Ball). Let βj,k denote coefficients of a function f in a wavelet basis indexedby j ∈ N, k ∈ [2j]. For parameters σ ≥ 0, p, q ∈ [1,∞], f ∈ L2 lies in the Besov ball Bσp,q iff

‖f‖Bσp,q

:=

∥∥∥∥{2j(σ+D(1/2−1/p))∥∥∥{βλ}λ∈Λj

∥∥∥lp

}j∈N

∥∥∥∥lq≤ 1.

The parameter q affects convergence rates only by logarithmic factors, so we omit it in sequel.

Examples of IPMs

Distance FLp (including Total Variation/L1) B0

p′, with p′ = pp−1

Wasserstein (“earth-mover”) B1∞ (1-Lipschitz class)

Kolmogorov-Smirnov B11 (total variation ≤ 1)

Max. mean discrepancy (MMD) RKHS ballGAN Discriminator parameterized by neural network

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0(a)

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

= 1, (1) (Wasserstein Metric)

Distribution PDistribution QDiscriminator f

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0(b)

0.1

0.0

0.1

0.2

0.3

0.4

= (0, 0.1) (Gaussian MMD)

Figure 1: Examples of probability distributions P and Q and corresponding discriminator functions f ∗. In (a), P andQ are Dirac masses at +1 and −1, resp., and F is the 1-Lipschitz class, so that ρF is the Wasserstein metric. In (b),P and Q are standard Gaussian and Laplace distributions, resp., and F is a ball in an RKHS with a Gaussian kernel.

Minimax Rates for General Estimators

Theorem 1. Suppose σg ≥ D/pg, p′d > pg. Then, up to polylog factors in n,

M(Bσgpg , B

σdpd

):= inf

psupp∈Bσg

pg

EX1:n

[ρBσd

pd(p, p(X1:n))

]� n− σg+σd

2σg+D + n−σg+σd+D−D/pg−D/p

′d

2σg+D−2D/pg + n−12.

Moreover, this rate is achieved by the wavelet thresholding estimator of Donoho et al. [1].

Minimax Rates for Linear Estimators

Definition 3 (Linear Estimator). A distribution estimate P is said to be linear if there exist mea-sures Ti(Xi, ·) such that for all measurable A,

P (A) =

n∑i=1

Ti(Xi, A).

Examples: empirical distribution, kernel density estimate, or orthogonal series estimate.Theorem 2. Suppose r > σg ≥ D/pg. Then, up to polylog factors in n,

Mlin

(Bσgpg , B

σdpd

):= inf

plin

supp∈Bσg

pg

EX1:n

[ρBσd

pd(p, p(X1:n))

]� n− σg+σd

2σg+D + n− σg+σd−D/pg+D/p

′d

2σg+D−2D/pg+2D/p′d + n−

12,

where the inf is over all linear estimates of p ∈ Fg, and µp is the distribution with density p.

Error Bounds for GANsA natural statistical model for a perfectly optimized GAN as a distribution estimator is

P := argminQ∈Fg

supf∈Fd

EX∼Q

[f (X)]− EX∼Pn

[f (X)] , (1)

where Fd and Fg are function classes parametrized by the discriminator and generator, resp [2].

Theorem 3 (Convergence Rate of a Regularized GAN). Fix a Besov density class Bσgpg withσg > D/pg and discriminator class Bσdpd . Then, for some constant C > 0 depending only onBσdpd and Bσgpg , for any desired approximation error ε > 0, one can construct a GAN P of theform (1) (with Pn denoting the wavelet-thresholded distribution) whose discriminator networkNd and generator network Ng are fully-connected ReLU networks, such that

supP∈Bσg

pg

E[dBσd

pd

(P , P

)]. ε + n−η(D,σd,pd,σg,pg),

where η(D, σd, pd, σg, pg) is the optimal exponent in Theorem 1.

•Nd and Ng have (rate-optimal) depth polylog(1/ε) and width, max weight, and sparsity poly(1/ε).

• Proof uses recent fully-connected ReLU network for approximating Besov functions [3].

Example Phase Diagrams

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

d

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

g

Parametric

DenseSparse

Infeasible0 1 2 3 4 5 6 7 8

d

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Parametric

NonparametricInfeasible

0.0

0.1

0.2

0.3

0.4

0.5

General Estimators Linear Estimators

Figure 2: Minimax convergence rates as functions of discriminator smoothness σd and distribution function smooth-ness σg, in the case D = 4, pd = 1.2, pg = 2. Color shows exponent of minimax convergence rate (i.e., α(σd, σg) suchthat M

(Bσd1.2(RD), B

σg2 (RD)

)� n−α(σd,σg)), ignoring polylogarithmic factors.

Applications/Examples

Example 1. (Total variation/Wasserstein-type losses) If, for some σd > 0, F is a ball in Bs∞,we obtain generalizations of total variation (σd = 0) and Wasserstein (σd = 1) losses. For theselosses, we always have “Dense” rate

M(Bσgpg , B

σdpd

)� n− σg+σd

2σg+D + n−1/2.

Example 2. (Kolmogorov-Smirnov-type losses) If, for σd > 0, F is a ball in Bσd1 , we obtaingeneralizations of Kolmogorov-Smirnov loss (σd = 0). For these losses, we have “Sparse” rate

M(Bσgpg , B

σd1

)� n− σg+σd−D/pg

2σg+D−2D/pg + n−1/2.

Example 3. (Maximum Mean Discrepancy) If F is a ball of radius L in a reproducing kernelHilbert space with translation invariant kernel K(x, y) = κ(x− y) for some κ ∈ L2(X ), then,

supP Borel

E[ρF(P, P

)]≤L‖κ‖L2(X )√

n.

Example 4. (Sobolev IPMs) For σ ∈ N, Bσ2 is the σ-order Hilbert-Sobolev ball Bσ2 ={f ∈ L2(X ) :

∫X

(f (σ)(x)

)2dx ≤ ∞

}, where f (σ) is the σth derivative of f . For these losses,

we always have the rate

M(Bσg2 , B

σd2

)� n− σg+σd

2σg+D + n−1/2.

(note that n−1/2 dominates⇔ t ≥ 2d).

References

[1] David L Donoho, Iain M Johnstone, Gerard Kerkyacharian, and Dominique Picard. Densityestimation by wavelet thresholding. The Annals of Statistics, pages 508–539, 1996.

[2] Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabas Poc-zos. Nonparametric density estimation under adversarial losses. In NeurIPS, 2018.

[3] Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besovspaces: optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033, 2018.