nonparametric regression on low-dimensional manifolds

Nonparametric Regression on Low-Dimensional Manifolds using DeepReLU Networks

MINSHUO CHENISYE, Georgia Institute of Technology

Atlanta, GA 30332

HAOMING JIANGISYE, Georgia Institute of Technology

WENJING LIAO∗

Mathematics, Georgia Institute of Technology∗Corresponding author: [email protected]

AND

TUO ZHAOISYE, Georgia Institute of Technology

Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples neara low-dimensional manifold. This paper studies nonparametric regression of Holder functions on low-dimensional manifolds using deep ReLU networks. Suppose n training data are sampled from a Holderfunction in H s,α supported on a d-dimensional Riemannian manifold isometrically embedded in RD,with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying func-tion from the training data. The mean squared error of the empirical estimator is proved to converge in

the order of n−2(s+α)

2(s+α)+d log3 n. This result shows that deep ReLU networks give rise to a fast convergencerate depending on the data intrinsic dimension d, which is usually much smaller than the ambient dimen-sion D. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometricstructures of data, and partially explains the power of deep ReLU networks in tackling high-dimensionaldata with low-dimensional geometric structures.

Keywords: Nonparametric regression, Low-dimensional manifolds, Deep ReLU networks, Sample com-plexity, Uniform approximation theory

1. Introduction

Deep learning has made astonishing breakthroughs in various real-world applications, such as computervision (Krizhevsky et al., 2012; Goodfellow et al., 2014; Long et al., 2015), natural language processing(Graves et al., 2013; Bahdanau et al., 2014; Young et al., 2018), healthcare (Miotto et al., 2017; Jianget al., 2017), robotics (Gu et al., 2017), etc. For example, in image classification, the winner of the 2017ImageNet challenge retained a top-5 error rate of 2.25% (Hu et al., 2018), while the data set consistsof about 1.2 million labeled high resolution images in 1000 categories. In speech recognition, Amodeiet al. (2016) reported that deep neural networks outperformed humans with a 5.15% word error rate onthe LibriSpeech corpus constructed from audio books (Panayotov et al., 2015). Such a data set consistsof approximately 1000 hours of 16kHz read English speech from 8000 audio books.

The empirical success of deep learning has brought new challenges to the conventional wisdom of

c©

2 of 42 M. CHEN, H. JIANG, W. LIAO, T. ZHAO

machine learning. Data sets in these applications are high-dimensional and highly complex. In existingliterature, a minimax lower bound has been established for the optimal algorithm of learning Cs func-tions in RD (Gyorfi et al., 2006; Tsybakov, 2008). Suppose the underlying function is f0. The minimaxlower bound suggests a pessimistic sample complexity: To obtain an estimator f for Cs functions with anε-error uniformly (i.e., sup f0∈Cs ‖ f − f0‖L2 6 ε with ‖ · ‖L2 denoting the function L2 norm), the optimal

algorithm requires the sample size n & ε−2s+D

s in the worst scenario (i.e., when f0 is the most difficultfor the algorithm to estimate). We instantiate such a sample complexity bound for the ImageNet dataset, which consists of RGB images with a resolution of 224× 224. The theory above suggests that, toachieve an ε-error, the number of samples has to scale as ε−224×224×3/s, where the modulus of smooth-ness s is significantly smaller compared to 224× 224× 3. Setting ε = 0.1 already gives rise to a hugenumber of samples far beyond practical applications, which well exceeds 1.2 million labeled images inImageNet.

To bridge the aforementioned gap between theory and practice, we take the low-dimensional geo-metric structures in data sets into consideration. This is motivated by the fact that real-world data setsoften exhibit low-dimensional structures. Many images consist of projections of a three-dimensionalobject followed by some transformations, such as rotation, translation, and skeleton. This generatingmechanism induces a small number of intrinsic parameters (Hinton and Salakhutdinov, 2006; Osheret al., 2017). Speech data are composed of words and sentences following the grammar, and there-fore have small degrees of freedom (Djuric et al., 2015). More broadly, visual, acoustic, textual, andmany other types of data often have low-dimensional geometric structures due to rich local regulari-ties, global symmetries, repetitive patterns, or redundant sampling (Tenenbaum et al., 2000; Roweis andSaul, 2000). It is therefore reasonable to assume that data lie on a manifold M of dimension d� D.

In this paper, we study nonparametric regression problems (Wasserman, 2006; Gyorfi et al., 2006;Tsybakov, 2008) using neural networks in exploitation of low-dimensional geometric structures of data.Specifically, we model data as samples from a probability measure supported on a d-dimensional mani-fold M isometrically embedded in RD where d� D. The goal is to recover the regression function f0supported on M using the samples Sn = {(xi,yi)}n

i=1 with x ∈M and y ∈R. The xi’s are i.i.d. sampledfrom a distribution Dx on M , and the response yi satisfies

yi = f0(xi)+ξi,

where ξi’s are i.i.d. sub-Gaussian noise independent of the xi’s.We use multi-layer ReLU (Rectified Linear Unit) neural networks to recover f0. ReLU networks are

widely used in computer vision, speech recognition, natural language processing, etc. (Nair and Hinton,2010; Glorot et al., 2011; Maas et al., 2013). These networks can ease the notorious vanishing gradientissue during training, which commonly arises with sigmoid or hyperbolic tangent activations (Glorotet al., 2011; Goodfellow et al., 2016). Given an input x, an L-layer ReLU neural network computes anoutput as

f (x) =WL ·ReLU(WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1)+bL, (1.1)

where W1, . . . ,WL and b1, . . . ,bL are weight matrices and vectors of proper sizes, respectively, andReLU(·) denotes the entrywise rectified linear unit activation (i.e., ReLU(a) = max{0,a}). We fur-

NONPARAMETRIC REGRESSION ON LOW-DIMENSIONAL MANIFOLDS USING RELU NETWORKS 3 of 42

ther denote F as a class of neural networks with bounded weight parameters and bounded output:

F (R,κ,L, p,K) ={

f | f (x) in the form (1.1) with L-layers and width bounded by p,

‖ f‖∞6 R,‖Wi‖∞,∞ 6 κ,‖bi‖∞

6 κ for i = 1, . . . ,L,L

∑i=1‖Wi‖0 +‖bi‖0 6 K

},

where ‖·‖0 denotes the number of nonzero entries in a vector or a matrix, ‖·‖∞

denotes `∞ norm of afunction or entrywise `∞ norm of a vector, and for a matrix M, ‖M‖

∞,∞ = maxi, j |Mi j|.To obtain an estimator f ∈F (R,κ,L, p,K) of f0, we minimize the empirical quadratic risk

fn = argminf∈F (R,κ,L,p,K)

Rn( f ) = argminf∈F (R,κ,L,p,K)

1n

n

∑i=1

( f (xi)− yi)2 . (1.2)

The subscript n emphasizes that the estimator is obtained using n pairs of samples. Our theory showsthat fn enjoys a fast rate of convergence to f0, depending on the intrinsic dimension d. Let M be ad-dimensional compact Riemannian manifold isometrically embedded in RD with d� D. Assume Msatisfies some mild regularity conditions. For simplicity, we suppose f0 is a Cs function on M . Exten-sions to Holder and Sobolev functions are given in Theorem 3.1. For the network class F (R,κ,L, p,K),we choose

L = O(

2s+d2s

logn), p = O

(n

d2s+d

), K = O

(2s+d

2sn

d2s+d logn

), R = ‖ f0‖∞

,

and set κ as a constant depending on s and M . We use O to hide logarithmic factors (e.g., logD) andpolynomial factors in s and d. Let fn be the empirical minimizer of (1.2). Then we have

E[∫

M

(fn(x)− f0(x)

)2dDx(x)

]6 c(R2 +σ

2)n−2s

2s+d log3 n,

where the expectation is taken over the training samples Sn, σ2 is the variance of the ξi’s, and c is aconstant depending on logD, s, κ , and M .

Our theory implies that, in order to estimate a Cs regression function up to an ε-error, the samplecomplexity is n& ε−

2s+ds up to a log factor. This sample complexity depends on the intrinsic dimension

d, and thus largely improves on existing theories of nonparametric regression using neural networks,where the sample complexity scales as O(ε−

2s+Ds ) (Hamers and Kohler, 2006; Kohler and Krzyzak,

2005, 2016; Kohler and Mehnert, 2011; Schmidt-Hieber, 2017). Our result partially explains the suc-cess of deep ReLU neural networks in tackling complext high-dimensional data with low-dimensionalgeometric structures, e.g., images and speech data.

A key ingredient in our analysis is an efficient universal approximation theory of deep ReLU net-works for Cs functions on M (Theorem 3.2), which appeared in Chen et al. (2019). Specifically, weshow that, in order to uniformly approximate Cs functions on a d-dimensional manifold with an ε-error,the network consists of at most O(log1/ε + logD) layers and O(ε−d/s log1/ε +D log1/ε +D logD)neurons and weight parameters. Similar results also hold for functions in Holder and Sobolev spaces (seeTheorem 3.2). The network size in our approximation theory only weakly depends on data dimensionD, which significantly improves existing universal approximation theories of neural networks (Barron,1993; Mhaskar, 1996; Lu et al., 2017; Hanin, 2017; Yarotsky, 2017), where the network size scales as


O(ε−D/s). Figure 1 illustrates a huge gap between the network sizes used in practice (Tan and Le, 2019)and the required size predicted by existing theories (Yarotsky, 2017) for the ImageNet data set. Our net-work size also matches the lower bound up to logarithmic factors for a given manifold M (see Theorem3.3). Our approximation theory partially justifies why networks of moderate size have achieved a greatsuccess in various applications.

1.1 Related Work

0 20 40 60 80 100 120 140 160Number of Parameters (Millions)

74

76

78

80

82

84Im

agen

et T

op 1

Acc

urac

y (%

)

ResNet34

ResNet50

ResNet152

DenseNet201

Inceptionv2

InceptionResNetv2

NASNetA

NASNetA

ResNeXt101

Xception

AmoebaNetA

AmoebaNetC

SENet

B0

B3

B4

B5

B6

EfficientNetB7

10224x224

ExistingTheory

FIG. 1. Practical network sizes for the ImageNet data set (Tan and Le,2019) versus the required size predicted by existing theories (Yarot-sky, 2017).

Nonparametric regression has been widelystudied in statistics. A variety of methodshas been proposed to estimate the regressionfunction, including kernel method, wavelets,splines, and local polynomials (Wahba, 1990;Altman, 1992; Fan and Gijbels, 1996; Tsy-bakov, 2008; Gyorfi et al., 2006). Nonethe-less, there is limited study on regressionusing deep ReLU networks until recently.The earliest works focused on neural net-works with a single hidden layer and smoothactivations (e.g., sigmoidal and sinusoidalfunctions, Barron (1991); McCaffrey andGallant (1994)). Later results achieved theminimax lower bound for the mean squarederror in the order of O(n−

2s2s+D ) up to a logarithmic factor for Cs functions in RD (Hamers and Kohler,

2006; Kohler and Krzyzak, 2005, 2016; Kohler and Mehnert, 2011). Theories for deep ReLU networkswere developed in Schmidt-Hieber (2017), where the estimate matches the minimax lower bound up toa logarithmic factor for Holder functions. Extensions to more general function spaces, such as Besovspaces, can be found in (Suzuki, 2019) and results for classification problems can be found in Kim et al.(2018); Ohn and Kim (2019).

The rate of convergence in the results above are insufficient to understand the success of deep learn-ing due to the curse of data dimension with a large D. Fortunately, many real-world data sets exhibitlow-dimensional geometric structures. It has been demonstrated that, some classical methods can auto-matically adapt to the low-dimensional structures of data, and perform as well as if the low-dimensionalstructures were known. Results in this direction include local linear regression (Bickel and Li, 2007), k-nearest neighbor (Kpotufe, 2011), kernel regression (Kpotufe and Garg, 2013), and Bayesian Gaussianprocess regression (Yang et al., 2015), where optimal rates depending on the intrinsic dimension wereproved for functions having the second order of continuity (Bickel and Li, 2007), globally Lipschitzfunctions (Kpotufe, 2011), and Holder functions with Holder index no more than 1 (Kpotufe and Garg,2013). However, it was not clear if deep ReLU networks are adaptable to low-dimensional geometricstructures of data.

A crucial ingredient in the statistical analysis of neural networks is the universal approximationability of neural networks. Early results justified the existence of two-layer networks with continuoussigmoidal activations (a function σ(x) is sigmoidal, if σ(x)→ 0 as x→−∞, and σ(x)→ 1 as x→ ∞)for a universal approximation of continuous functions in a unit hypercube (Irie and Miyake, 1988; Funa-hashi, 1989; Cybenko, 1989; Hornik, 1991; Chui and Li, 1992; Leshno et al., 1993). In these works,the number of neurons was not explicitly given. Later, Barron (1993); Mhaskar (1996) proved that the


number of neurons can grow as ε−D/2 where ε is the uniform approximation error. Recently, Lu et al.(2017), Hanin (2017) and Daubechies et al. (2019) extended the universal approximation theorem tonetworks of bounded width with ReLU activations. The depth of such networks has to grow exponen-tially with respect to the dimension of data. Yarotsky (2017) showed that ReLU neural networks canuniformly approximate functions in Sobolev spaces, where the network size scales exponentially withrespect to the data dimension and matches the lower bound. Zhou (2019) also developed a universalapproximation theory for deep convolutional neural networks (Krizhevsky et al., 2012), where the depthof the network scales exponentially with respect to the data dimension.

The aforementioned results focus on functions on a compact subset (e.g., [0,1]D) in RD. Functionapproximation on manifolds has been well studied using classical methods, such as local polynomials(Bickel et al., 2007) and wavelets (Coifman and Maggioni, 2006). However, studies using neural net-works are limited. Two noticeable works are Chui and Mhaskar (2016) and Shaham et al. (2018). InChui and Mhaskar (2016), high order differentiable functions on manifolds are approximated by neu-ral networks with smooth activations, e.g., sigmoid activations and rectified quadratic unit functions(max2{0,x}). These smooth activations are rarely used in mainstream applications such as computervision (Krizhevsky et al., 2012; Long et al., 2015; Hu et al., 2018). In Shaham et al. (2018), a 4-layer network with ReLU activations was proposed to approximate C2 functions on low-dimensionalmanifolds that have absolutely summable wavelet coefficients. This theory does not cover arbitrarilyCs functions. We are also aware of a concurrent work of ours, Shen et al. (2019), which establishedan approximation theory of ReLU networks for Holder functions in terms of a modulus of continuity.When the target function belongs to H s,α and is supported in a neighborhood of a d-dimensional mani-fold embedded in RD, Shen et al. (2019) constructed a ReLU network which yields approximation errorin the order of N−2min(s+α,1)/dδ L−2min(s+α,1)/dδ where N and L are the width and depth of the network,and d < dδ <D. Their proof utilizes a different approach compared to ours: They first construct a piece-wise constant function to approximate the target function, and then implement the piecewise constantfunction using a ReLU network. Unfortunately, the higher order smoothness (while s+α > 1) is notexploited due to the use of piecewise constant approximations.

1.2 Roadmap and Notations

The rest of the paper is organized as follows: Section 2 presents a brief introduction to manifolds andfunctions on manifolds. Section 3 presents the main theory of efficient statistical recovery using deepReLU neural networks on low-dimensional manifolds, and a new universal approximation theory ofReLU networks; Section 4 sketches the proofs of theories in Section 3, and the detailed proofs aredeferred to the appendix; Section 5 provides a conclusion of the paper and discusses open questions andfuture directions.

We use bold-faced letters to denote vectors, and normal font letters with a subscript to denote itscoordinate, e.g., x ∈ Rd and xk being the k-th coordinate of x. Given a vector s = [s1, . . . ,sd ]

> ∈ Nd ,we define s! = ∏

di=1 si! and |s| = ∑

di=1 si. We define xs = ∏

di=1 xsi

i . Given a function f : Rd 7→ R, we

denote its derivative as Ds f = ∂ |s| f∂x

s11 ...∂x

sdd

, and its `∞ norm as ‖ f‖∞= maxx | f (x)|. We use ◦ to denote the

composition operator.

2. Preliminaries

We briefly review manifolds, partition of unity, and function spaces defined on smooth manifolds.Details can be found in Tu (2010) and Lee (2006).


Let M be a d-dimensional Riemannian manifold isometrically embedded in RD.

DEFINITION 2.1 (Chart) A chart for M is a pair (U,φ) such that U ⊂M is open and φ : U 7→ Rd ,where φ is a homeomorphism (i.e., bijective, φ and φ−1 are both continuous).

The open set U is called a coordinate neighborhood, and φ is called a coordinate system on U . Achart essentially defines a local coordinate system on M . We say two charts (U,φ) and (V,ψ) on Mare Ck compatible if and only if the transition functions,

φ ◦ψ−1 : ψ(U ∩V ) 7→ φ(U ∩V ) and ψ ◦φ

−1 : φ(U ∩V ) 7→ ψ(U ∩V )

are both Ck. Then we give the definition of an atlas.

DEFINITION 2.2 (Ck Atlas) An atlas for M is a collection {(Uα ,φα)}α∈A of pairwise Ck compatiblecharts such that

⋃α∈A Uα = M .

DEFINITION 2.3 (Smooth Manifold) A smooth manifold is a manifold M together with a C∞ atlas.

Classical examples of smooth manifolds are the Euclidean space RD, the torus, and the unit sphere.The existence of an atlas on M allows us to define differentiable functions.

DEFINITION 2.4 (Cs Functions on M ) Let M be a smooth manifold in RD. A function f : M 7→ R isCs if for any chart (U,φ), the composition f ◦φ−1 : φ(U) 7→R is continuously differentiable up to orders.

REMARK 2.1 The definition of Cs functions is independent of the choice of the chart (U,φ). Suppose(V,ψ) is another chart and V

⋂U 6= /0. Then we have

f ◦ψ−1 = ( f ◦φ

−1)◦ (φ ◦ψ−1).

Since M is a smooth manifold, (U,φ) and (V,ψ) are C∞ compatible. Thus, f ◦φ−1 is Cs and φ ◦ψ−1

is C∞, and their composition is Cs.

We next introduce partition of unity, which plays a crucial role in our construction of neural net-works.

DEFINITION 2.5 (Partition of Unity) A C∞ partition of unity on a manifold M is a collection of non-negative C∞ functions ρα : M 7→ R+ for α ∈A such that

1. the collection of supports, {supp(ρα)}α∈A is locally finite, i.e., every point on M has a neigh-borhood that meets only finitely many of Aα ’s;

2. ∑ρα = 1.

For a smooth manifold, a C∞ partition of unity always exisits.

PROPOSITION 2.6 (Existence of a C∞ partition of unity) Let {Uα}α∈A be an open cover of a smoothmanifold M . Then there is a C∞ partition of unity {ρi}∞

i=1 with every ρi having a compact support suchthat supp(ρi)⊂Uα for some α ∈A .

Proposition 2.6 gives rise to the decomposition f = ∑∞i=1 fi with fi = f ρi. Note that the fi’s have the

same regularity as f , sincefi ◦φ

−1 = ( f ◦φ−1)× (ρi ◦φ

−1)


for a chart (U,φ). This decomposition has the advantage that every fi is only supported in a single chart.Then to control the bias of estimating f boils down to the approximation of fi’s, which are localized andhave the same regularity as f .

To characterize the curvature of a manifold, we adopt the following geometric concept.

DEFINITION 2.7 (Reach, Definition 2.1 in Aamari et al. (2019)) Denote

C (M ) =

{x ∈ RD : ∃p 6= q ∈M ,‖p−x‖2 = ‖q−x‖2 = inf

y∈M‖y−x‖2

}

as the set of points that have at least two nearest neighbors on M . The reach τ > 0 is defined as

τ = infx∈M ,y∈C (M )

‖x−y‖2 .

Reach has a straightforward geometrical interpretation: At each point x ∈M , the radius of the oscu-

Large ⌧<latexit sha1_base64="fKNHLI5C5ExbR/gZzZNpO6xHyvs=">AAACB3icbVDLSsNAFL2pr1pfVZduBlvBVUnqwi4Lbly4qGAf0IQymUzaoZNJmJkIJfQD/AG3+gfuxK2f4Q/4HU7aLLT1wIXDOffF8RPOlLbtL6u0sbm1vVPereztHxweVY9PeipOJaFdEvNYDnysKGeCdjXTnA4SSXHkc9r3pze533+kUrFYPOhZQr0IjwULGcHaSO4dlmOK6q7GaX1UrdkNewG0TpyC1KBAZ1T9doOYpBEVmnCs1NCxE+1lWGpGOJ1X3FTRBJMpHtOhoQJHVHnZ4uc5ujBKgMJYmhIaLdTfExmOlJpFvumMsJ6oVS8X//UClS9cua7DlpcxkaSaCrI8HqYc6RjloaCASUo0nxmCiWTmf0QmWGKiTXQVE4yzGsM66TUbzlWjed+stVtFRGU4g3O4BAeuoQ230IEuEEjgGV7g1Xqy3qx362PZWrKKmVP4A+vzB2KXmWA=</latexit>

Small ⌧<latexit sha1_base64="vBdcAHOfpTIU6FPoZZYSLqeYSbM=">AAACB3icbVDLSsNAFL2pr1pfVZduBlvBVUnqwi4LblxWtA9oQplMJu3QySTMTIQS+gH+gFv9A3fi1s/wB/wOJ20W2nrgwuGc++L4CWdK2/aXVdrY3NreKe9W9vYPDo+qxyc9FaeS0C6JeSwHPlaUM0G7mmlOB4mkOPI57fvTm9zvP1KpWCwe9CyhXoTHgoWMYG0k9z7CnKO6q3FaH1VrdsNeAK0TpyA1KNAZVb/dICZpRIUmHCs1dOxEexmWmhFO5xU3VTTBZIrHdGiowBFVXrb4eY4ujBKgMJamhEYL9fdEhiOlZpFvOiOsJ2rVy8V/vUDlC1eu67DlZUwkqaaCLI+HKUc6RnkoKGCSEs1nhmAimfkfkQmWmGgTXcUE46zGsE56zYZz1WjeNWvtVhFRGc7gHC7BgWtowy10oAsEEniGF3i1nqw36936WLaWrGLmFP7A+vwBeWyZbg==</latexit>

Rapid Change<latexit sha1_base64="mXjvclsmbbGt3EGkxmEEuCvFg6k=">AAACB3icbVDLSsNAFJ34rPVVdelmsAiuSlIXdlnoxmUV+4A2lMnkph06mQwzE6GEfoA/4Fb/wJ249TP8Ab/DSZuFth64cDjnvjiB5Ewb1/1yNja3tnd2S3vl/YPDo+PKyWlXJ6mi0KEJT1Q/IBo4E9AxzHDoSwUkDjj0gmkr93uPoDRLxIOZSfBjMhYsYpQYKw3viWQhbk2IGMOoUnVr7gJ4nXgFqaIC7VHlexgmNI1BGMqJ1gPPlcbPiDKMcpiXh6kGSeiUjGFgqSAxaD9b/DzHl1YJcZQoW8Lghfp7IiOx1rM4sJ0xMRO96uXiv16o84Ur103U8DMmZGpA0OXxKOXYJDgPBYdMATV8Zgmhitn/MZ0QRaix0ZVtMN5qDOukW69517X6Xb3abBQRldA5ukBXyEM3qIluURt1EEUSPaMX9Oo8OW/Ou/OxbN1wipkz9AfO5w/3ppm9</latexit>

Slow Change<latexit sha1_base64="i1c8ZnWMheX2Fx9Xq/Ai3mUCXLc=">AAACBnicbVC7TgJBFJ3FF+ILtbSZSEysyC4WUpLQWGKURwIbMjt7gQmzM5uZWQ3Z0PsDtvoHdsbW3/AH/A5nYQsFT3KTk3PuKyeIOdPGdb+cwsbm1vZOcbe0t39weFQ+PulomSgKbSq5VL2AaOBMQNsww6EXKyBRwKEbTJuZ330ApZkU92YWgx+RsWAjRomxUv+Oy0fcnBAxhmG54lbdBfA68XJSQTlaw/L3IJQ0iUAYyonWfc+NjZ8SZRjlMC8NEg0xoVMyhr6lgkSg/XTx8hxfWCXEI6lsCYMX6u+JlERaz6LAdkbETPSql4n/eqHOFq5cN6O6nzIRJwYEXR4fJRwbibNMcMgUUMNnlhCqmP0f0wlRhBqbXMkG463GsE46tap3Va3d1iqNeh5REZ2hc3SJPHSNGugGtVAbUSTRM3pBr86T8+a8Ox/L1oKTz5yiP3A+fwBXkZlo</latexit>

FIG. 2. Manifolds with large and small reaches.

lating circle is greater or equal to τ . A large reach for M essentially requires the manifold M not tochange “rapidly” as shown in Figure 2.

Reach determines a proper choice of an atlas for M . In Section 4, we choose each chart Uα con-tained in a ball of radius less than τ/2. For smooth manifolds with a small τ , we need a large number ofcharts. Therefore, reach of a smooth manifold reflects the complexity of the neural network for functionapproximation on M .

3. Main Results – Statistical Recovery Theory

This section contains our main recovery theory of learning regression models on low dimensional man-ifolds using deep neural networks. We begin with some assumptions on the regression model and themanifold M .

Assumption 1. M is a d-dimensional compact Riemannian manifold isometrically embedded in RD.There exists a constant B > 0 such that, for any point x ∈M , we have |xi|6 B for all i = 1, . . . ,D.

Assumption 2. The reach of M is τ > 0.

Assumption 3. The ground truth function f0 : M 7→ R belongs to the Holder space H s,α(M ) with apositive integer s and α ∈ (0,1], in the sense that for any chart (U,φ), we have

1. f0 ◦φ−1 ∈Cs−1 with |Ds f0 ◦φ−1|6 1 for any |s|< s, x ∈U ;

2. for any |s|= s and x1,x2 ∈U ,∣∣∣Ds( f0 ◦φ

−1)∣∣φ(x1)−Ds( f0 ◦φ

−1)∣∣φ(x2)

∣∣∣6 ‖φ(x1)−φ(x2)‖α

2 . (3.1)


Assumption 3 requires that all s-th order derivatives of f0 ◦φ−1 are Holder continuous. We recoverthe standard Holder class on Euclidean spaces by taking φ as the identity map. We also note thatAssumption 3 does not depend on the choice of charts.

The following theorem characterizes the convergence rate for the estimation of f0 using ReLU neuralnetworks. For simplicity, we state the result for Holder functions. Extensions to Sobolev spaces arestraightforward.

THEOREM 3.1 Suppose Assumptions 1 - 3 hold. Let fn be the minimizer of empirical risk (1.2) withthe network class F (R,κ,L, p,K) properly designed such that

L = O(

2(s+α)+d2(s+α)

logn), p = O

(n

d2(s+α)+d

), K = O

(2(s+α)+d

2(s+α)n

d2(s+α)+d logn

),

R = ‖ f0‖∞, and κ = max{1,B,

√d,τ2}.

Then we have

E[∫

M

(fn(x)− f0(x)

)2dDx(x)

]6 c(R2 +σ

2)n−2(s+α)

2(s+α)+d log3 n,

where the expectation is taken over the training samples Sn, and c is a constant depending on logD, s,κ , and M .

Theorem 3.1 is established by a bias-variance trade-off (see its proof in Section 4). Here are someremarks:

1. The network class in Theorem 3.1 is sparsely connected, i.e. K =O(Lp), while densely connectednetworks satisfy K = O(Lp2).

2. The network class F (R,κ,L, p,K) has outputs uniformly bounded by R. Such a requirement canbe achieved by appending an additional clipping layer to the end of the network structure, i.e.,

g(a) = max{−R,min{a,R}}= ReLU(a−R)−ReLU(a+R)−R.

3. Each weight parameter in our network class is bounded by a constant κ only depending on thecurvature τ , the range B of the manifold M and the smoothness s of f0. Such boundednesscondition is crucial to our theory and can be computationally realized by normalization after eachstep of the stochastic gradient descent.

Theorem 3.1 quantifies the network size for learning f0. A natural question is: How is the networkstructure properly designed? The answer is given by the following universal approximation theory ofReLU networks for Holder functions supported on the manifold M .

THEOREM 3.2 Suppose Assumptions 1 and 2 hold. Given any ε ∈ (0,1), there exists a ReLU networkstructure such that, for any f : M → R satisfying Assumption 3, if the weight parameters are properlychosen, the network yields a function f satisfying ‖ f − f‖∞ 6 ε. Such a network has

1. no more than c1(log 1ε+ logD) layers,

2. at most c2(ε− d

s+α log 1ε+D log 1

ε+D logD) neurons and weight parameters,


where c1,c2 depend on d, s, f , τ , and the surface area of M .

The network structure identified by Theorem 3.2 consists of three sub-networks as shown in Figure3 (The detailed construction of each sub-network is postponed to Section 4):• Chart determination sub-network, which assigns each input to its corresponding neighborhoods;• Taylor approximation sub-network, which approximates f by polynomials in each neighborhood;• Pairing sub-network, which yields multiplications of the proper pairs of outputs from the chart

determination and the Taylor approximation sub-networks.

P<latexit sha1_base64="PVOqlyFnedxpBSr3+6iiXLUv7WU=">AAACAXicbVC7TgJBFL2LL8QXamkzEUysyC4WUpLYWGIijwQ2ZHZ2FkZmZjczsyaEUPkDtvoHdsbWL/EH/A5nYQsFT3KTk3PuKydIONPGdb+cwsbm1vZOcbe0t39weFQ+PunoOFWEtknMY9ULsKacSdo2zHDaSxTFIuC0G0xuMr/7SJVmsbw304T6Ao8kixjBxkqd6kCnojosV9yauwBaJ15OKpCjNSx/D8KYpIJKQzjWuu+5ifFnWBlGOJ2XBqmmCSYTPKJ9SyUWVPuzxbdzdGGVEEWxsiUNWqi/J2ZYaD0Vge0U2Iz1qpeJ/3qhzhauXDdRw58xmaSGSrI8HqUcmRhlcaCQKUoMn1qCiWL2f0TGWGFibGglG4y3GsM66dRr3lXNvatXmo08oiKcwTlcggfX0IRbaEEbCDzAM7zAq/PkvDnvzseyteDkM6fwB87nD0HflyI=</latexit>

Input x<latexit sha1_base64="Z/U1DzalJffUZKD3830KV63JMDY=">AAACD3icbVBNTsJAGP2Kf4h/FZduJoKJK9LiQpYkbnSHiYAJNGQ6TGHCdNrMTA2k4RBewK3ewJ1x6xG8gOdwCl0o+JJJXt739+b5MWdKO86XVdjY3NreKe6W9vYPDo/s43JHRYkktE0iHskHHyvKmaBtzTSnD7GkOPQ57fqT66zefaRSsUjc61lMvRCPBAsYwdpIA7t8K+JEo2o/xHrsB+l0Xh3YFafmLIDWiZuTCuRoDezv/jAiSUiFJhwr1XOdWHsplpoRTuelfqJojMkEj2jPUIFDqrx04X2Ozo0yREEkzRMaLdTfEykOlZqFvunMLKrVWib+WxuqbOHKdR00vJRlX6aCLI8HCUc6Qlk4aMgkJZrPDMFEMuMfkTGWmGgTYckE467GsE469Zp7Wavf1SvNRh5REU7hDC7AhStowg20oA0EpvAML/BqPVlv1rv1sWwtWPnMCfyB9fkDOqecig==</latexit>

Chart Determination<latexit sha1_base64="eYW/DtD3JxkNK4qnf8s9n3kGubQ=">AAACEHicbVDLSsNAFJ3UV62vapduBovgqiR1YZeFunBZwT6gDWUyuWmHTiZhZiKE0J/wB9zqH7gTt/6BP+B3OGmz0NYDFw7n3BfHizlT2ra/rNLW9s7uXnm/cnB4dHxSPT3rqyiRFHo04pEcekQBZwJ6mmkOw1gCCT0OA2/eyf3BI0jFIvGg0xjckEwFCxgl2kiTaq0zI1LjW9AgQyYKtW437CXwJnEKUkcFupPq99iPaBKC0JQTpUaOHWs3M4sZ5bCojBMFMaFzMoWRoYKEoNxs+fwCXxrFx0EkTQmNl+rviYyESqWhZzpDomdq3cvFfz1f5QvXruug5WZMxIkGQVfHg4RjHeE8HewzCVTz1BBCJTP/Y2oCItTEoyomGGc9hk3Sbzac60bzvllvt4qIyugcXaAr5KAb1EZ3qIt6iKIUPaMX9Go9WW/Wu/Wxai1ZxUwN/YH1+QN9/51D</latexit>

Taylor Approximation<latexit sha1_base64="XO7DmYmkw5sml1IVBEt9AA149Z8=">AAACEXicbVA7TgMxFPSGXwi/8OloLCIkqmg3FKQMoqEMUn5Ssoq8Xm9ixWtbthexrHIKLkALN6BDtJyAC3AOnGQLSHjVaGbee6MJJKPauO6XU1hb39jcKm6Xdnb39g/Kh0cdLRKFSRsLJlQvQJowyknbUMNITyqC4oCRbjC5mende6I0FbxlUkn8GI04jShGxlLD8kkLpdYDr6VU4oHGOV1xq+584CrwclAB+TSH5e9BKHASE24wQ1r3PVcaP0PKUMzItDRINJEIT9CI9C3kKCbaz+bpp/DcMiGMbIhIcAPn7O+NDMVap3FgnTbeWC9rM/JfLdSzg0vfTVT3M8plYgjHi+dRwqARcFYPDKki2LDUAoQVtfkhHiOFsLEllmwx3nINq6BTq3qX1dpdrdKo5xUVwSk4AxfAA1egAW5BE7QBBo/gGbyAV+fJeXPenY+FteDkO8fgzzifP66Tne4=</latexit>

Pairing<latexit sha1_base64="SEMrlTJf4KpMunKRiogQz4XSGQc=">AAACAnicbVDLSsNAFL3xWeur6tLNYBFclaQu7LLgxmUF+4A2lMlk2g6dTMLMjVBCd/6AW/0Dd+LWH/EH/A4nbRbaeuDC4Zz74gSJFAZd98vZ2Nza3tkt7ZX3Dw6Pjisnpx0Tp5rxNotlrHsBNVwKxdsoUPJeojmNAsm7wfQ297uPXBsRqwecJdyP6FiJkWAUrdRtUaGFGg8rVbfmLkDWiVeQKhRoDSvfgzBmacQVMkmN6Xtugn5GNQom+bw8SA1PKJvSMe9bqmjEjZ8t3p2TS6uEZBRrWwrJQv09kdHImFkU2M6I4sSsern4rxeafOHKdRw1/EyoJEWu2PL4KJUEY5LnQUKhOUM5s4QyLez/hE2opgxtamUbjLcawzrp1Gveda1+X682G0VEJTiHC7gCD26gCXfQgjYwmMIzvMCr8+S8Oe/Ox7J1wylmzuAPnM8ft9eX/w==</latexit>

�1<latexit sha1_base64="F3h1zoHDXmmgsvSwmSaFr/3F4xA=">AAACAXicbVDLSsNAFL1TX7W+qi7dDBbBVUmqYJcFNy4r2FZoQ5lMJu3YySTMTIQSuvIH3OofuBO3fok/4Hc4abPQ1gMXDufcF8dPBNfGcb5QaW19Y3OrvF3Z2d3bP6geHnV1nCrKOjQWsbr3iWaCS9Yx3Ah2nyhGIl+wnj+5zv3eI1Oax/LOTBPmRWQkecgpMVbqDpIxH7rDas2pO3PgVeIWpAYF2sPq9yCIaRoxaaggWvddJzFeRpThVLBZZZBqlhA6ISPWt1SSiGkvm387w2dWCXAYK1vS4Ln6eyIjkdbTyLedETFjvezl4r9eoPOFS9dN2PQyLpPUMEkXx8NUYBPjPA4ccMWoEVNLCFXc/o/pmChCjQ2tYoNxl2NYJd1G3b2oN24va61mEVEZTuAUzsGFK2jBDbShAxQe4Ble4BU9oTf0jj4WrSVUzBzDH6DPH5ckl1o=</latexit>

�CM<latexit sha1_base64="QDp9fSJqzOGvREr19Mu0d/Xrn48=">AAACEXicbVDLSsNAFJ3UV62v+Ni5GSyCq5JUwS4L3bgRKtgHNCFMJpN26GQSZiZCDfkKf8Ct/oE7cesX+AN+h5M2C209MHA45965h+MnjEplWV9GZW19Y3Orul3b2d3bPzAPj/oyTgUmPRyzWAx9JAmjnPQUVYwME0FQ5DMy8Kedwh88ECFpzO/VLCFuhMachhQjpSXPPHGSCfWyjudESE0wYtltnntm3WpYc8BVYpekDkp0PfPbCWKcRoQrzJCUI9tKlJshoShmJK85qSQJwlM0JiNNOYqIdLN5+hyeayWAYSz04wrO1d8bGYqknEW+niwyymWvEP/1All8uHRdhS03ozxJFeF4cTxMGVQxLOqBARUEKzbTBGFBdX6IJ0ggrHSJNV2MvVzDKuk3G/Zlo3l3VW+3yoqq4BScgQtgg2vQBjegC3oAg0fwDF7Aq/FkvBnvxsditGKUO8fgD4zPH8Gpnfs=</latexit>

d21

<latexit sha1_base64="/nKRVIb+wwcGaNuyb9PPB7nyDzA=">AAACBnicbVDLSsNAFJ3UV62vqks3g0VwVZIq2GXBjcsKthXaWCaTSTt0MhNmboQSuvcH3OofuBO3/oY/4Hc4abPQ1gMXDufcFydIBDfgul9OaW19Y3OrvF3Z2d3bP6geHnWNSjVlHaqE0vcBMUxwyTrAQbD7RDMSB4L1gsl17vcemTZcyTuYJsyPyUjyiFMCVuoPxgSycDb0HhrDas2tu3PgVeIVpIYKtIfV70GoaBozCVQQY/qem4CfEQ2cCjarDFLDEkInZMT6lkoSM+Nn85dn+MwqIY6UtiUBz9XfExmJjZnGge2MCYzNspeL/3qhyRcuXYeo6WdcJikwSRfHo1RgUDjPBIdcMwpiagmhmtv/MR0TTSjY5Co2GG85hlXSbdS9i3rj9rLWahYRldEJOkXnyENXqIVuUBt1EEUKPaMX9Oo8OW/Ou/OxaC05xcwx+gPn8wdqZJl0</latexit>

d2CM

<latexit sha1_base64="B/N0upGd5GE5FqlQtZ78az+zfRo=">AAACFnicbVDLSsNAFJ34rPUVdSVuBovgqiRVsMtCN26ECvYBTQyTyaQdOpmEmYlQQvA3/AG3+gfuxK1bf8DvcNJmoa0HBg7n3NccP2FUKsv6MlZW19Y3Nitb1e2d3b198+CwJ+NUYNLFMYvFwEeSMMpJV1HFyCARBEU+I31/0i78/gMRksb8Tk0T4kZoxGlIMVJa8sxjZ4xUFuRe1vacCKkxRiy7yfP7hmfWrLo1A1wmdklqoETHM7+dIMZpRLjCDEk5tK1EuRkSimJG8qqTSpIgPEEjMtSUo4hIN5t9IYdnWglgGAv9uIIz9XdHhiIpp5GvK4sr5aJXiP96gSwGLmxXYdPNKE9SRTieLw9TBlUMi4xgQAXBik01QVhQfT/EYyQQVjrJqg7GXoxhmfQadfui3ri9rLWaZUQVcAJOwTmwwRVogWvQAV2AwSN4Bi/g1Xgy3ox342NeumKUPUfgD4zPH7SooBU=</latexit>

b1�<latexit sha1_base64="gDkYxWJ3p16Y/ECTdAJ4ayGkePM=">AAACGnicbVDLSsNAFJ34rPVVdSlCsBVclaQu7LKgC5cV7AOaECaT23bo5MHMjVJCV/6GP+BW/8CduHXjD/gdTtoutPXAwOGc+5rjJ4IrtKwvY2V1bX1js7BV3N7Z3dsvHRy2VZxKBi0Wi1h2fapA8AhayFFAN5FAQ19Axx9d5X7nHqTicXSH4wTckA4i3ueMopa80knFeeABDClmTkhxGKjMnkw85xoE0opXKltVawpzmdhzUiZzNL3StxPELA0hQiaoUj3bStDNqETOBEyKTqogoWxEB9DTNKIhKDebfmNinmklMPux1C9Cc6r+7shoqNQ49HVlfqpa9HLxXy9Q+cCF7divuxmPkhQhYrPl/VSYGJt5TmbAJTAUY00ok1zfb7IhlZShTrOog7EXY1gm7VrVvqjWbmvlRn0eUYEck1NyTmxySRrkhjRJizDySJ7JC3k1now34934mJWuGPOeI/IHxucPWJqhbw==</latexit>

b1�<latexit sha1_base64="gDkYxWJ3p16Y/ECTdAJ4ayGkePM=">AAACGnicbVDLSsNAFJ34rPVVdSlCsBVclaQu7LKgC5cV7AOaECaT23bo5MHMjVJCV/6GP+BW/8CduHXjD/gdTtoutPXAwOGc+5rjJ4IrtKwvY2V1bX1js7BV3N7Z3dsvHRy2VZxKBi0Wi1h2fapA8AhayFFAN5FAQ19Axx9d5X7nHqTicXSH4wTckA4i3ueMopa80knFeeABDClmTkhxGKjMnkw85xoE0opXKltVawpzmdhzUiZzNL3StxPELA0hQiaoUj3bStDNqETOBEyKTqogoWxEB9DTNKIhKDebfmNinmklMPux1C9Cc6r+7shoqNQ49HVlfqpa9HLxXy9Q+cCF7divuxmPkhQhYrPl/VSYGJt5TmbAJTAUY00ok1zfb7IhlZShTrOog7EXY1gm7VrVvqjWbmvlRn0eUYEck1NyTmxySRrkhjRJizDySJ7JC3k1now34934mJWuGPOeI/IHxucPWJqhbw==</latexit>

b⇥<latexit sha1_base64="5rKBV9G1ak+LnP3Hp02BoVHsI7A=">AAACD3icbVBLTsMwFHTKr5RfKEs2ES0SqyopC1hWsGFZJPqRmqhyHKe16jiR/QJUUQ/BBdjCDdghthyBC3AOnDYLaBnJ0mjmPc/T+AlnCmz7yyitrW9sbpW3Kzu7e/sH5mG1q+JUEtohMY9l38eKciZoBxhw2k8kxZHPac+fXOd+755KxWJxB9OEehEeCRYygkFLQ7Nadx9YQMcYMhdYRNWsPjRrdsOew1olTkFqqEB7aH67QUzSiAogHCs1cOwEvAxLYITTWcVNFU0wmeARHWgqsI7xsvntM+tUK4EVxlI/AdZc/b2R4UipaeTryQjDWC17ufivF6j8w6V0CC+9jIkkBSrIIjxMuQWxlZdjBUxSAnyqCSaS6fstMsYSE9AVVnQxznINq6TbbDjnjeZts9a6Kioqo2N0gs6Qgy5QC92gNuoggh7RM3pBr8aT8Wa8Gx+L0ZJR7ByhPzA+fwC18Zze</latexit>

ef(x)<latexit sha1_base64="LHEqTPMCPnYrQjeOBCtEDfV/V9s=">AAACGHicbVDLSsNAFJ34rPUVddlNsBXqpqRVfOyKblxWsA9oQplMbtqhkwczE7WELPwNf8Ct/oE7cevOH/A7TNIg2npg4HDOfc2xAkaF1PVPZWFxaXlltbBWXN/Y3NpWd3Y7wg85gTbxmc97FhbAqAdtSSWDXsABuxaDrjW+TP3uLXBBfe9GTgIwXTz0qEMJlok0UEsV447aICmzIXLiquFiObKc6D4+rAzUsl7TM2jzpJ6TMsrRGqhfhu2T0AVPEoaF6Nf1QJoR5pISBnHRCAUEmIzxEPoJ9bALwoyyT8TaQaLYmuPz5HlSy9TfHRF2hZi4VlKZ3ihmvVT817NFOnBmu3TOzIh6QSjBI9PlTsg06WtpSppNORDJJgnBhNPkfo2MMMdEJlkWs2DOU5z8xDBPOo1a/ah2fN0oNy/yiAqohPZRFdXRKWqiK9RCbUTQA3pCz+hFeVRelTflfVq6oOQ9e+gPlI9vq4SguA==</latexit>

ef1<latexit sha1_base64="GsA17F4fN5nHel44YuT3aQTycoU=">AAACDHicbVDLSsNAFJ3UV62PRl26GSyCq5JUwS4LblxWsA9oQ5lMbtqhk0mYmSgl9Bf8Abf6B+7Erf/gD/gdTtostPXAhcM598XxE86Udpwvq7SxubW9U96t7O0fHFbto+OuilNJoUNjHsu+TxRwJqCjmebQTySQyOfQ86c3ud97AKlYLO71LAEvImPBQkaJNtLIrg4fWQCa8QCycD5yR3bNqTsL4HXiFqSGCrRH9vcwiGkagdCUE6UGrpNoLyNSM8phXhmmChJCp2QMA0MFiUB52eLxOT43SoDDWJoSGi/U3xMZiZSaRb7pjIieqFUvF//1ApUvXLmuw6aXMZGkGgRdHg9TjnWM82RwwCRQzWeGECqZ+R/TCZGEapNfxQTjrsawTrqNuntZb9xd1VrNIqIyOkVn6AK56Bq10C1qow6iKEXP6AW9Wk/Wm/VufSxbS1Yxc4L+wPr8AaS3m70=</latexit>

efCM<latexit sha1_base64="D7+S+hctqX0vuLu8b8jOjuEFUzk=">AAACGnicbVDLSsNAFJ34rPUVdSlCsAiuSlIFuyx040aoYB/QhDCZ3LRDJw9mJkoJWfkb/oBb/QN34taNP+B3OGmz0NYDA4dz7muOlzAqpGl+aSura+sbm5Wt6vbO7t6+fnDYE3HKCXRJzGI+8LAARiPoSioZDBIOOPQY9L1Ju/D798AFjaM7OU3ACfEoogElWCrJ1U/sB+qDpMyHLMjdrO3aIZZjgll2k+euXjPr5gzGMrFKUkMlOq7+bfsxSUOIJGFYiKFlJtLJMJeUMMirdiogwWSCRzBUNMIhCCebfSM3zpTiG0HM1YukMVN/d2Q4FGIaeqqyuFEseoX4r+eLYuDCdhk0nYxGSSohIvPlQcoMGRtFToZPORDJpopgwqm63yBjzDGRKs2qCsZajGGZ9Bp166LeuL2stZplRBV0jE7RObLQFWqha9RBXUTQI3pGL+hVe9LetHftY166opU9R+gPtM8fjP6iLQ==</latexit>

b⇥⇣( efi � �i), (b1� � bd2

i )⌘

<latexit sha1_base64="3/H7GEKCN7N78z+GWJf39FkbtaM=">AAACcHicbVHbattAEF2pN8e9xG1fAn3Itk7BDsFI7kPzGGge+phCnASyrlitRtaQ1YXdUYsR+tD8QD4gP9CuHIe2TgcWDmfO2dk5G1caLQXBtec/evzk6bPeVv/5i5evtgev35zZsjYKZqrUpbmIpQWNBcwIScNFZUDmsYbz+OpL1z//AcZiWZzSsoJ5LhcFpqgkOSoa2D3xExPIJDWCMAfbcqEhpdFoxRPqBJq0jZALhUZxUWUY4fiAj/74cklZYpuwbSNxDJrkvfZekTj/9+lYGFxkNN6LBsNgEqyKPwThGgzZuk6iwY1ISlXnUJDS0trLMKho3khDqDS0fVFbqKS6kgu4dLCQbo95swqn5R8dk/C0NO4UxFfs345G5tYu89gpu0XsZq8j/9tLbHfhxnRKD+cNFlVNUKi74WmtOZW8S58naECRXjoglUH3fq4yaaQi90d9F0y4GcNDcDadhJ8m02/T4dHhOqIee8c+sBEL2Wd2xL6yEzZjil2zX17P2/Ju/R1/139/J/W9tect+6f8/d+/jb57</latexit>

...<latexit sha1_base64="/4tyr2CRglXNS7iGErVDHQ3QOrY=">AAACDHicbVDLSsNAFJ3UV62PRl26GSyCq5JUQZdFNy4r2Ac0pUwmk3boZBJmbgol9Bf8Abf6B+7Erf/gD/gdTtostPXAwOGce+cejp8IrsFxvqzSxubW9k55t7K3f3BYtY+OOzpOFWVtGotY9XyimeCStYGDYL1EMRL5gnX9yV3ud6dMaR7LR5glbBCRkeQhpwSMNLSrXkRg7IeZNw1i0POhXXPqzgJ4nbgFqaECraH97QUxTSMmgQqidd91EhhkRAGngs0rXqpZQuiEjFjfUEkipgfZIvgcnxslwGGszJOAF+rvjYxEWs8i30zmMfWql4v/eoHOP1y5DuHNIOMySYFJujwepgJDjPNmcMAVoyBmhhCquMmP6ZgoQsH0VzHFuKs1rJNOo+5e1hsPV7XmbVFRGZ2iM3SBXHSNmugetVAbUZSiZ/SCXq0n6816tz6WoyWr2DlBf2B9/gAiAZwU</latexit>




FIG. 3. The ReLU network identified by Theorem 3.2.

Our result significantly improves existing approximation theories Yarotsky (2017) where the net-work size grows exponentially with respect to the ambient dimension D, i.e. ε−D/(s+α). Theorem 3.2also improves Shaham et al. (2018) for Cs functions in the case that s > 2. When s > 2, our networksize scales like ε−d/s, which is significantly smaller than the one in Shaham et al. (2018) in the order ofε−d/2.

Moreover, the size of our ReLU network in Theorem 3.2 matches its lower bound in DeVore et al.(1989) up to a logarithmic factor for the approximation of functions in the Holder space H s−1,1([0,1]d)defined on [0,1]d .

THEOREM 3.3 Fix d and s. Let W be a positive integer and T : RW 7→ C([0,1]d) be any mapping.Suppose there is a continuous map Θ : H s−1,1([0,1]d) 7→ RW such that ‖ f −T (Θ( f ))‖∞ 6 ε for anyf ∈H s−1,1([0,1]d). Then W > cε−

ds with c depending on s only.

We take RW as the parameter space of a ReLU network, and T as the transformation given by theReLU network. Theorem 3.3 implies that, to approximate any f ∈H s−1,1([0,1]d), the ReLU networkneeds at least cε−

ds weight parameters. Although Theorem 3.3 holds for functions defined on [0,1]d , our

network size remains in the same order up to a logarithmic factor even when the function is supportedon a manifold of dimension d.

On the other hand, the lower bound also reveals that the low-dimensional manifold model plays avital role to reduce the network size. To approximate functions in H s−1,1([0,1]D) with accuracy ε ,the minimal number of weight parameters is O(ε−

Ds ). This lower bound cannot be improved without

low-dimensional structures of data.


4. Proof of the Statistical Recovery Theory

This section contains the proof sketch of Theorem 3.1 and Theorem 3.2.

4.1 Proof of Theorem 3.1

We dilate the L2 risk of fn using its empirical counterpart as

E[∫

M

(fn(x)− f0(x)

)2dDx(x)

]= 2E

[1n

n

∑i=1

( fn(xi)− f0(xi))2

]

︸︷︷︸T1

+E[∫

M

(fn(x)− f0(x)

)2dDx(x)

]−2E

[1n

n

∑i=1

( fn(xi)− f0(xi))2

]

︸︷︷︸T2

.

The proof of Theorem 3.1 contains two parts for the estimates of T1 and T2, respectively.Bounding T1. The first term T1 reflects the bias of the estimation, and can be bounded through the

following Lemma.

LEMMA 4.1 Fix the neural network class F (R,κ,L, p,K). For any constant δ ∈ (0,2R), we have

T1 6 4 inff∈F (R,κ,L,p,K)

∫

M( f (x)− f0(x))2dDx(x)+

64σ2 logN (δ ,F (R,κ,L, p,K),‖·‖∞)

n+32σδ ,

where N (δ ,F (R,κ,L, p,K),‖·‖∞) denotes the δ -covering number of F (R,κ,L, p,K) with respect to

the `∞ norm.

Proof Sketch. The detailed proof is provided in Appendix C.1. We decompose T1 into two parts:

T1 = 2E

[1n

n

∑i=1

( fn(xi)− f0(xi)−ξi +ξi)2

]

= 2E

[1n

n

∑i=1

( fn(xi)− yi +ξi)2

]

6 2E

[inf

f∈F (R,κ,L,p,K)

1n

n

∑i=1

( f (xi)− yi)2 +

2n

ξi fn(xi)

]

= 2 inff∈F (R,κ,L,p,K)

∫

M( f (x)− f0(x))2dDx(x)

︸︷︷︸(A)

+4E

[1n

n

∑i=1

ξi fn(xi)

]

︸︷︷︸(B)

,

where the inequality is derived using the independence between the noise ξi and response yi, and fnbeing the empirical risk minimizer.

As can be seen, term (A) is the smallest L2 risk achieved by the network class F (R,κ,L, p,K),which can be quantified using our approximation theory (Theorem 3.2). Term (B) is a complexitymeasure of the network class F (R,κ,L, p,K). To upper bound (B), we discretize F (R,κ,L, p,K) as


{ f ∗i }N (δ ,F (R,κ,L,p,K),‖·‖∞)i=1 . By the definition of covering, there exists f ∗ such that ‖ fn − f ∗‖∞ 6 δ .

Denote ‖ f − f0‖n =1n ∑

ni=1( f (xi)− f0(xi))

2, we cast (B) into

(B) = E

[1n

n

∑i=1

ξi( fn(xi)− f ∗(xi)+ f ∗(xi)− f0(xi))

]

(i)6 E

[1n

n

∑i=1

ξi( f ∗(xi)− f0(xi))

]+δσ

= E[‖ f ∗− f0‖n√

n∑

ni=1 ξi( f ∗(xi)− f0(xi))√

n‖ f ∗− f0‖n

]+δσ

(ii)6√

2E

[‖ fn− f0‖n +δ√

n∑

ni=1 ξi( f ∗(xi)− f0(xi))√

n‖ f ∗− f0‖n

]+δσ ,

where (i) follows from Holder’s inequality and (ii) is obtained by some algebraic manipulation. To breakthe dependence between f ∗ and the samples, we replace f ∗ by any f ∗j in the δ -covering and observe that∑

ni=1 ξi( f ∗(xi)− f0(xi))√

n‖ f ∗− f0‖n6 max j

∑ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n . Note that∑

ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n is a sub-Gaussian random

variable with parameter σ2 for given xi’s. It is well established in the literature on empirical processes(van der Vaart et al., 1996) that maximum of a collection of sub-Gaussian random variables satisfies

E

[max

j

∑ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n

]6 2σ

√logN (δ ,F (R,κ,L, p,K),‖·‖

∞).

Substituting the above inequality into (B) and combining (A) and (B), we have

T1 = 2E[‖ fn− f0‖2

n

]6 2 inf

f∈F (R,κ,L,p,K)

∫

M( f (x)− f0(x))2dDx(x)

+8√

2σ

(E[‖ fn− f0‖n

]+δ

)√ logN (δ ,F (R,κ,L, p,K),‖·‖∞)

n+4δσ

Some algebra further gives rise to the desired result

T1 6 4 inff∈F (R,κ,L,p,K)

∫

M( f (x)− f0(x))2dDx(x)+

64σ2 logN (δ ,F (R,κ,L, p,K),‖·‖∞)

n+32σδ .

�Bounding T2. We observe that T2 is the difference between the population L2 risk of fn and its

empirical counterpart. However, bounding such a difference is distinct from conventional concentrationresults due to the scaling factor 2 before the empirical risk.

LEMMA 4.2 For any constant δ ∈ (0,2R), T2 satisfies

T2 652R2

3nlogN (δ/4R,F (R,κ,L, p,K),‖·‖

∞)+

(4+

12R

)δ .

Proof Sketch. The detailed proof is deferred to Appendix C.2. For notational simplicity, we denoteg(x) = ( fn(x)− f0(x))2 and ‖g‖

∞6 4R2. Applying the inequality

∫M g2dDx(x) 6 4R2 ∫

M gdDx(x)


(Barron, 1991), we rewrite T2 as

T2 = E

[∫

Mg(x)dDx(x)−

2n

n

∑i=1

g(xi)

]

= 2E

[∫

Mg(x)dDx(x)−

1n

n

∑i=1

g(xi)−12

∫

Mg(x)dDx(x)

]

6 2E

[∫

Mg(x)dDx(x)−

1n

n

∑i=1

g(xi)−1

8R2

∫

Mg2(x)dDx(x)

].

We now utilize the symmetrization technique in existing literature on nonparametric statistics (van derVaart et al., 1996; Gyorfi et al., 2006). Specifically, let xi’s be independent replications of xi’s and Ui bei.i.d. Rademacher random variables, i.e., P(Ui = 1) = P(Ui =−1) = 1/2. We bound T2 as

T2 6 2E

[supg∈G

∫

Mg(x)dDx(x)−

1n

n

∑i=1

g(xi)−1

8R2

∫

Mg2(x)dDx(x)

]

6 2Ex,x

[supg∈G

1n

n

∑i=1

(g(xi)−g(xi))−1

8R2

∫

Mg2(x)dDx(x)

]

6 2Ex,x,U

[supg∈G

1n

n

∑i=1

Ui(g(xi)−g(xi))−1

16R2Ex,x[g2(x)+g2(x)

]],

where G = {g= ( f − f0)2 | f ∈F (R,κ,L, p,K)} and Ex denotes the expectation with respect to x. Note

here g2(x)+g2(x) contributes as the variance term of Ui(g(xi)−g(xi)), which yields a fast convergenceof T2 as n grows.

Similar to bounding T1, we discretize the function space G using a δ -covering denoted by G ∗. Thisallows us to replace the supremum by the maximum over a finite set:

T2 6 2Ex,x,U

[sup

g∗∈G ∗1n

n

∑i=1

Ui(g∗(xi)−g∗(xi))−1

16R2Ex,x[(g∗)2(x)+(g∗)2(x)

]]+

(4+

12R

)δ .

We can bound the above maximum by the Bernstein’s inequality, which yields

T2 652R2

3nlogN (δ ,G ,‖·‖

∞)+

(4+

12R

)δ

The last step is to relate the covering number of G to that of F (R,κ,L, p,K). Specifically, considerany g1,g2 ∈ G with g1 = ( f1− f0)

2 and g2 = ( f2− f0)2, respectively. We can derive

‖g1−g2‖∞= sup

x∈M| f1(x)− f2(x)| | f1(x)+ f2(x)−2 f0(x)|6 4R‖ f1− f2‖∞

.

Therefore, the inequality N (δ ,G ,‖·‖∞)6N (δ/4R,F (R,κ,L, p,K),‖·‖

∞) holds, which implies

T2 652R2


∞)+

(4+

12R

)δ .


�Bounding the Covering Number N (δ ,F (R,κ,L, p,K),‖·‖

∞). Since each weight parameter in

the network is bounded by a constant κ , we construct a covering by partition the range of each weightparameter into a uniform grid. Consider f , f ′ ∈F (R,κ,L, p,K) with each weight parameter differingat most h. By an induction on the number of layers in the network, we show that the `∞ norm of thedifference f − f ′ scales as

∥∥ f − f ′∥∥

∞6 hL(pB+2)(κ p)L−1.

As a result, to achieve a δ -covering, it suffices to choose h such that hL(pB+2)(κ p)L−1 = δ . Therefore,the covering number is bounded by

N (δ ,F (R,κ,L, p,K),‖·‖∞)6

(κ

h

)K6

(κL(pB+2)(κ p)L−1

δ

)K

.

Choosing δ and Bounding the L2 Risk. Combining T1 and T2 together and substituting the coveringnumber, we derive

E[∫

M

(fn(x)− f0(x)

)2dDx(x)

]6 4 inf

f∈F (R,κ,L,p,K)

∫

M( f (x)− f0(x))2dDx(x)

+52R2 +192σ2


∞)

+

(4+

12R

+32σ

)δ

6 4 inff∈F (R,κ,L,p,K)

∫

M( f (x)− f0(x))2dDx(x)

+52R2 +192σ2

3nK log

4RκL(pB+2)(κ p)L−1

δ

+

(4+

12R

+32σ

)δ .

Choosing δ = 1/n gives rise to

E[∫

M

(fn(x)− f0(x)

)2dDx(x)

]6 4 inf

f∈F (R,κ,L,p,K)

∫

M( f (x)− f0(x))2dDx(x)

+ O(

R2 +σ2

nKL log(RκLpn)+

1n

).

We further set inf f∈F (R,κ,L,p,K)‖ f (x)− f0(x)‖∞ 6 ε . Theorem 3.2 suggests that we choose L = O(log 1ε)

and K = O(ε−d

s+α log 1ε). Plugging in L and K, we have

E[∫

M

(fn(x)− f0(x)

)2dDx(x)

]= O

(ε

2 +R2 +σ2

nε− d

s+α log2 1ε

log(RκLpn)+1n

).

To balance the error terms, we pick ε satisfying ε2 = 1n ε− d

s+α , which gives ε = n−s+α

d+2(s+α) . The proof of

Theorem 3.1 is complete by substituting ε = n−s+α

d+2(s+α) and rearranging terms.


4.2 Proof of Theorem 3.2

We next sketch the proof of Theorem 3.2. A preliminary version has appeared in Chen et al. (2019).Before we proceed, we show how to approximate the multiplication operation using ReLU networks.This operation is heavily used in the Taylor approximation sub-network, since Taylor polynomialsinvolve sum of products. We first show ReLU networks can approximate quadratic functions.

LEMMA 4.3 (Proposition 2 in Yarotsky (2017)) The function f (x) = x2 with x ∈ [0,1] can be approxi-mated by a ReLU network with any error ε > 0. The network has depth and the number of neurons andweight parameters no more than c log(1/ε) with an absolute constant c.

This lemma is proved in Appendix A.1. The idea is to approximate quadratic functions using aweighted sum of a series of sawtooth functions. Those sawtooth functions are obtained by compositingthe triangular function

g(x) = 2ReLU(x)−4ReLU(x−1/2)+2ReLU(x−1),

which can be implemented by a single layer ReLU network.We then approximate the multiplication operation by invoking the identity ab = 1

4 ((a+ b)2− (a−b)2) where the two squares can be approximated by ReLU networks in Lemma 4.3.

COROLLARY 4.1 (Proposition 3 in Yarotsky (2017)) Given a constant C > 0 and ε ∈ (0,C2), there is aReLU network which implements a function × : R2 7→ R such that: 1). For all inputs x and y satisfying|x|6C and |y|6C, we have |×(x,y)−xy|6 ε; 2). The depth and the weight parameters of the networkis no more than c log C2

εwith an absolute constant c.

The ReLU network in Theorem 3.2 is constructed in the following 5 steps.Step 1. Construction of an atlas. Denote the open Euclidean ball with center c and radius r in RD

by B(c,r). For any r, the collection {B(x,r)}x∈M is an open cover of M . Since M is compact, thereexists a finite collection of points ci for i = 1, . . . ,CM such that M ⊂⋃i B(ci,r).

Ui<latexit sha1_base64="+uWov4e4PdBDYxgIJSR0pKNwOoM=">AAAB/nicbVDLSsNAFL2pr1pfVZduBovgqiS1oMuCmy4rmrbQhjKZTNqhk0mYmQglFPwBt/oH7sStv+IP+B1O2iy09cCFwzn3xfETzpS27S+rtLG5tb1T3q3s7R8cHlWPT7oqTiWhLol5LPs+VpQzQV3NNKf9RFIc+Zz2/Olt7vceqVQsFg96llAvwmPBQkawNtK9O2Kjas2u2wugdeIUpAYFOqPq9zCISRpRoQnHSg0cO9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovW7w6RxdGCVAYS1NCo4X6eyLDkVKzyDedEdYTterl4r9eoPKFK9d1eONlTCSppoIsj4cpRzpGeRYoYJISzWeGYCKZ+R+RCZaYaJNYxQTjrMawTrqNunNVb9w1a612EVEZzuAcLsGBa2hBGzrgAoExPMMLvFpP1pv1bn0sW0tWMXMKf2B9/gB2cJY8</latexit>

�i<latexit sha1_base64="6wfaupwd6Pdz3cweTZ57+84o4FY=">AAACAXicbVDLSsNAFL1TX7W+qi7dDBbBVUmqoMuCmy4r2FZoQ5lMJu3YySTMTIQSuvIH3OofuBO3fok/4Hc4abPQ1gMXDufcF8dPBNfGcb5QaW19Y3OrvF3Z2d3bP6geHnV1nCrKOjQWsbr3iWaCS9Yx3Ah2nyhGIl+wnj+5yf3eI1Oax/LOTBPmRWQkecgpMVbqDpIxH/JhtebUnTnwKnELUoMC7WH1exDENI2YNFQQrfuukxgvI8pwKtisMkg1SwidkBHrWypJxLSXzb+d4TOrBDiMlS1p8Fz9PZGRSOtp5NvOiJixXvZy8V8v0PnCpesmvPYyLpPUMEkXx8NUYBPjPA4ccMWoEVNLCFXc/o/pmChCjQ2tYoNxl2NYJd1G3b2oN24va81WEVEZTuAUzsGFK2hCC9rQAQoP8Awv8Iqe0Bt6Rx+L1hIqZo7hD9DnD/U0l6I=</latexit>

More Open Balls to Cover

⇢ [0, 1]d<latexit sha1_base64="Q4TpLbYskWVW0d9iz9+tsM2Veu0=">AAACDXicbVDLSsNAFL3xWesr6tLNYBFcSEmqoMuiG5cV7APaWCaTSTt0Mgkzk0IJ/QZ/wK3+gTtx6zf4A36HkzYLbT1w4XDOfXH8hDOlHefLWlldW9/YLG2Vt3d29/btg8OWilNJaJPEPJYdHyvKmaBNzTSnnURSHPmctv3Rbe63x1QqFosHPUmoF+GBYCEjWBupb9s9lfqKatR1zpHrPQZ9u+JUnRnQMnELUoECjb793QtikkZUaMKxUl3XSbSXYakZ4XRa7qWKJpiM8IB2DRU4osrLZp9P0alRAhTG0pTQaKb+nshwpNQk8k1nhPVQLXq5+K8XqHzhwnUdXnsZE0mqqSDz42HKkY5RHg0KmKRE84khmEhm/kdkiCUm2gRYNsG4izEsk1at6l5Ua/eXlfpNEVEJjuEEzsCFK6jDHTSgCQTG8Awv8Go9WW/Wu/Uxb12xipkj+APr8weHOpsF</latexit>

FIG. 4. Curvature decides the number of charts: smallerreach requires more chart.

Now we pick the radius r < τ/2 so that

Ui = M ∩B(ci,r)

is diffeomorphic to a ball in Rd (Niyogi et al., 2008)as illustrated in Figure 4. Let {(Ui,φi)}CM

i=1 be an atlason M , where φi is to be defined in Step 2. The num-ber of charts CM is upper bounded by

CM 6

⌈SA(M )

rd Td

⌉,

where SA(M) is the surface area of M , and Td is the thickness of the Ui’s, which is defined as theaverage number of Ui’s that contain a point on M (Conway et al., 1987).

REMARK 4.1 The thickness Td scales approximately linear in d. As shown in Conway et al. (1987),there exists covering with d

e√

e . Td 6 d logd +d log logd +5d.

Step 2. Projection with rescaling and translation. We denote the tangent space at ci as

Tci(M ) = span(vi1, . . . ,vid),


where {vi1, . . . ,vid} form an orthonormal basis. We obtain the matrix Vi = [vi1, . . . ,vid ] ∈ RD×d byconcatenating vi j’s as column vectors.

Unmatched

Uj<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ui<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

xi<latexit sha1_base64="XDpw8cAx/zc/9WPXea/utByy0Lw=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBFuhqzKt4mNXcOOygn1AO5ZMmmlDM5khyahl6H+4caGIW//FnX9jZjqIWg8EDufcyz05bsiZ0rb9aeWWlldW1/LrhY3Nre2d4u5eWwWRJLRFAh7IrosV5UzQlmaa024oKfZdTjvu5DLxO3dUKhaIGz0NqePjkWAeI1gb6bbc97Eeu178MBuw8qBYsqt2CrRIahkpQYbmoPjRHwYk8qnQhGOlejU71E6MpWaE01mhHykaYjLBI9ozVGCfKidOU8/QkVGGyAukeUKjVP25EWNfqanvmskkpPrrJeJ/Xi/S3rkTMxFGmgoyP+RFHOkAJRWgIZOUaD41BBPJTFZExlhiok1RhbSEiwSn319eJO16tXZcPbmulxqVrI48HMAhVKAGZ9CAK2hCCwhIeIRneLHurSfr1Xqbj+asbGcffsF6/wJN55Jw</latexit>

xj<latexit sha1_base64="ceAL5mLg1AhoCjoO0cTSQZAM478=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYCt0VaZVfOwKblxWsA9ox5JJM21sJjMkGbUM/Q83LhRx67+482/MTAdR64HA4Zx7uSfHDTlT2rY/rYXFpeWV1dxafn1jc2u7sLPbUkEkCW2SgAey42JFORO0qZnmtBNKin2X07Y7vkj89h2VigXiWk9C6vh4KJjHCNZGuin1fKxHrhc/TPu3pX6haFfsFGieVDNShAyNfuGjNwhI5FOhCcdKdat2qJ0YS80Ip9N8L1I0xGSMh7RrqMA+VU6cpp6iQ6MMkBdI84RGqfpzI8a+UhPfNZNJSPXXS8T/vG6kvTMnZiKMNBVkdsiLONIBSipAAyYp0XxiCCaSmayIjLDERJui8mkJ5wlOvr88T1q1SvWocnxVK9bLWR052IcDKEMVTqEOl9CAJhCQ8AjP8GLdW0/Wq/U2G12wsp09+AXr/QtPbJJx</latexit>

�i(xj)<latexit sha1_base64="/zuruomcsf8c56qegN+rTzii7U4=">AAACEHicbVDLSsNAFJ3Ud31FXboJtmLdlLSKj53gxqWCfUBTwmR6046dPJi5EUvIJ7jxV9y4UMStS3f+jUlaRK0HBg7n3Dtz5jih4ApN81MrzMzOzS8sLhWXV1bX1vWNzaYKIsmgwQIRyLZDFQjuQwM5CmiHEqjnCGg5w/PMb92CVDzwr3EUQtejfZ+7nFFMJVvfsxDuML8nltBL4rIVDrjNK5ZHceC48V1i3+yXE1svmVUzhzFNahNSIhNc2vqH1QtY5IGPTFClOjUzxG5MJXImIClakYKQsiHtQyelPvVAdeM8SGLspkrPcAOZHh+NXP25EVNPqZHnpJNZTPXXy8T/vE6E7kk35n4YIfhs/JAbCQMDI2vH6HEJDMUoJZRJnmY12IBKyjDtsJiXcJrh6PvL06RZr9YOqodX9dJZZVLHItkmO6RCauSYnJELckkahJF78kieyYv2oD1pr9rbeLSgTXa2yC9o71/RHJ29</latexit>

�i(xi)<latexit sha1_base64="ZwHM1JxSQnMefImlAK5+toAZPAg=">AAAB/3icbVDLSsNAFJ3UV62vqODGzWAr1E1JqvjYFdy4rGAf0JQwmU7aoZNJmJmIJXbhr7hxoYhbf8Odf+MkDaLWAwOHc+7lnjlexKhUlvVpFBYWl5ZXiqultfWNzS1ze6ctw1hg0sIhC0XXQ5IwyklLUcVINxIEBR4jHW98mfqdWyIkDfmNmkSkH6Ahpz7FSGnJNfcqTjSiLq06AVIjz0/upi49qrhm2apZGeA8sXNSBjmarvnhDEIcB4QrzJCUPduKVD9BQlHMyLTkxJJECI/RkPQ05Sggsp9k+afwUCsD6IdCP65gpv7cSFAg5STw9GSaUv71UvE/rxcr/7yfUB7FinA8O+THDKoQpmXAARUEKzbRBGFBdVaIR0ggrHRlpayEixSn31+eJ+16zT6unVzXy41qXkcR7IMDUAU2OAMNcAWaoAUwuAeP4Bm8GA/Gk/FqvM1GC0a+swt+wXj/Aj0ulac=</latexit>

�j(xj)<latexit sha1_base64="K12YxKnhPh1vuURyb0vNrGyfYLk=">AAAB/3icbVDLSsNAFJ3UV62vqODGzWAr1E1JqvjYFdy4rGAf0IQwmU7aaSeTMDMRS+zCX3HjQhG3/oY7/8YkDaLWAwOHc+7lnjluyKhUhvGpFRYWl5ZXiqultfWNzS19e6ctg0hg0sIBC0TXRZIwyklLUcVINxQE+S4jHXd8mfqdWyIkDfiNmoTE9tGAU49ipBLJ0fcqVjikzqhq+UgNXS++mzqjo4qjl42akQHOEzMnZZCj6egfVj/AkU+4wgxJ2TONUNkxEopiRqYlK5IkRHiMBqSXUI58Iu04yz+Fh4nSh14gkscVzNSfGzHypZz4bjKZppR/vVT8z+tFyju3Y8rDSBGOZ4e8iEEVwLQM2KeCYMUmCUFY0CQrxEMkEFZJZaWshIsUp99fniftes08rp1c18uNal5HEeyDA1AFJjgDDXAFmqAFMLgHj+AZvGgP2pP2qr3NRgtavrMLfkF7/wJAR5Wp</latexit>

M<latexit sha1_base64="bjDmPp0n5on4Yn3q+eqtfRjwNqY=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBFvBVZmpC10W3LgRKtgHtEPJpJk2NJOMSaZQhn6HGxeKuPVj3Pk3ZtpZaOuBwOGce7knJ4g508Z1v53CxubW9k5xt7S3f3B4VD4+aWuZKEJbRHKpugHWlDNBW4YZTruxojgKOO0Ek9vM70yp0kyKRzOLqR/hkWAhI9hYya/2I2zGBPP0fl4dlCtuzV0ArRMvJxXI0RyUv/pDSZKICkM41rrnubHxU6wMI5zOS/1E0xiTCR7RnqUCR1T76SL0HF1YZYhCqewTBi3U3xspjrSeRYGdzDLqVS8T//N6iQlv/JSJODFUkOWhMOHISJQ1gIZMUWL4zBJMFLNZERljhYmxPZVsCd7ql9dJu17zrmr1h3ql4eZ1FOEMzuESPLiGBtxBE1pA4Ame4RXenKnz4rw7H8vRgpPvnMIfOJ8/OxqRrw==</latexit>

FIG. 5. Projecting x j using amatched chart (blue) (U j,φ j), and anunmatched chart (green) (Ui,φi).

Defineφi(x) = bi(V>i (x− ci)+ui) ∈ [0,1]d

for any x∈Ui, where bi ∈ (0,1] is a scaling factor and ui is a translationvector. Since Ui is bounded, we can choose proper bi and ui to guar-antee φi(x) ∈ [0,1]d . We rescale and translate the projection to ease thenotation for the development of local Taylor approximations in Step 4.We also remark that each φi is a linear function, and can be realized bya single layer linear network.

Step 3. Chart determination. This step is to locate the charts thata given input x belongs to. This avoids projecting x using unmatchedcharts (i.e., x 6∈U j for some j) as illustrated in Figure 5.

An input x can belong to multiple charts, and the chart determination sub-network determines allthese charts. This can be realized by compositing an indicator function and the squared Euclideandistance

d2i (x) = ‖x− ci‖2

2 =D

∑j=1

(x j− ci, j)2

for i = 1, . . . ,CM . The squared distance d2i (x) is a sum of univariate quadratic functions, thus, we can

apply Lemma 4.3 to approximate d2i (x) by ReLU networks. Denote hsq as an approximation of the

quadratic function x2 on [0,1] with an approximation error ν . Then we define

d2i (x) = 4B2

D

∑j=1

hsq

(∣∣∣∣x j− ci, j

2B

∣∣∣∣).

as an approximation of d2i (x). The approximation error is ‖d2

i −d2i ‖∞ 6 4B2Dν , by the triangle inequal-

ity. We consider an approximation of the indicator function of an interval as in Figure 6:

1∆ (a) =

1 a6 r2−∆ +4B2mν

− 1∆−8B2mν

a+ r2−4B2mν

∆−8B2mνa ∈ [r2−∆ +4B2mν ,r2−4B2mν ]

0 a > r2−4B2mν

, (4.1)

where ∆ (∆ > 8B2mν) will be chosen later according to the accuracy ε .

Uj<latexit sha1_base64="pwIwGLln04Srwr8ZSNz3PWIUkEE=">AAAB/nicbVDLSsNAFL2pr1pfVZduBovgqiRV0GXRjcuKpi20oUwmk3bsZBJmJkIJBX/Arf6BO3Hrr/gDfoeTNgttPXDhcM59cfyEM6Vt+8sqrayurW+UNytb2zu7e9X9g7aKU0moS2Iey66PFeVMUFczzWk3kRRHPqcdf3yd+51HKhWLxb2eJNSL8FCwkBGsjXTnDh4G1Zpdt2dAy8QpSA0KtAbV734QkzSiQhOOleo5dqK9DEvNCKfTSj9VNMFkjIe0Z6jAEVVeNnt1ik6MEqAwlqaERjP190SGI6UmkW86I6xHatHLxX+9QOULF67r8NLLmEhSTQWZHw9TjnSM8ixQwCQlmk8MwUQy8z8iIywx0SaxignGWYxhmbQbdees3rg9rzWviojKcATHcAoOXEATbqAFLhAYwjO8wKv1ZL1Z79bHvLVkFTOH8AfW5w92OpY3</latexit>

Ui<latexit sha1_base64="ibpjcNHhvCg9dutr56SrKdo26j0=">AAAB/nicbVDLSsNAFL2pr1pfVZduBovgqiS1oMuiG5cVTVtoQ5lMJu3QySTMTIQSCv6AW/0Dd+LWX/EH/A4nbRbaeuDC4Zz74vgJZ0rb9pdVWlvf2Nwqb1d2dvf2D6qHRx0Vp5JQl8Q8lj0fK8qZoK5mmtNeIimOfE67/uQm97uPVCoWiwc9TagX4ZFgISNYG+neHbJhtWbX7TnQKnEKUoMC7WH1exDEJI2o0IRjpfqOnWgvw1IzwumsMkgVTTCZ4BHtGypwRJWXzV+doTOjBCiMpSmh0Vz9PZHhSKlp5JvOCOuxWvZy8V8vUPnCpes6vPIyJpJUU0EWx8OUIx2jPAsUMEmJ5lNDMJHM/I/IGEtMtEmsYoJxlmNYJZ1G3bmoN+6atdZ1EVEZTuAUzsGBS2jBLbTBBQIjeIYXeLWerDfr3fpYtJasYuYY/sD6/AF0opY2</latexit>

cj<latexit sha1_base64="eccxobEmlbvwr7/7p8f0+ar7AvI=">AAACDXicbVDLSsNAFL2pr1pfUZdugkVwVZIq6LLoxmUF+4A2hMlk2o6dzISZSaGEfoM/4Fb/wJ249Rv8Ab/DSZuFth4Y5nDOfXHChFGlXffLKq2tb2xulbcrO7t7+wf24VFbiVRi0sKCCdkNkSKMctLSVDPSTSRBcchIJxzf5n5nQqSigj/oaUL8GA05HVCMtJEC2+6HgkVqGpsvw7PgMbCrbs2dw1klXkGqUKAZ2N/9SOA0JlxjhpTqeW6i/QxJTTEjs0o/VSRBeIyGpGcoRzFRfja/fOacGSVyBkKax7UzV393ZChW+XGmMkZ6pJa9XPzXi1Q+cGm7Hlz7GeVJqgnHi+WDlDlaOHk0TkQlwZpNDUFYUnO/g0dIIqxNgBUTjLccwypp12veRa1+f1lt3BQRleEETuEcPLiCBtxBE1qAYQLP8AKv1pP1Zr1bH4vSklX0HMMfWJ8/7w6cgw==</latexit>

1<latexit sha1_base64="ylmYBBh39LZ1en+l8d8qe0iQbxY=">AAACCXicbVDLSgMxFL3js9ZX1aWbYBFclZkq6MJFwY3LCvYB7VgymUwbmkmGJCOUoV/gD7jVP3Anbv0Kf8DvMNPOQlsPhBzOuS9OkHCmjet+OSura+sbm6Wt8vbO7t5+5eCwrWWqCG0RyaXqBlhTzgRtGWY47SaK4jjgtBOMb3K/80iVZlLcm0lC/RgPBYsYwcZKD/1A8lBPYvtl3nRQqbo1dwa0TLyCVKFAc1D57oeSpDEVhnCsdc9zE+NnWBlGOJ2W+6mmCSZjPKQ9SwWOqfaz2dVTdGqVEEVS2ScMmqm/OzIc6/w0WxljM9KLXi7+64U6H7iw3URXfsZEkhoqyHx5lHJkJMpjQSFTlBg+sQQTxez9iIywwsTY8Mo2GG8xhmXSrte881r97qLauC4iKsExnMAZeHAJDbiFJrSAgIJneIFX58l5c96dj3npilP0HMEfOJ8/jEebPQ==</latexit>

0<latexit sha1_base64="wn+bO+lusf1RnGorkqG5qjWDbcI=">AAACCXicbVDLSgMxFL3js9ZX1aWbYBFclZkq6MJFwY3LCvYB7VgymUwbmkmGJCOUoV/gD7jVP3Anbv0Kf8DvMNPOQlsPhBzOuS9OkHCmjet+OSura+sbm6Wt8vbO7t5+5eCwrWWqCG0RyaXqBlhTzgRtGWY47SaK4jjgtBOMb3K/80iVZlLcm0lC/RgPBYsYwcZKD/1A8lBPYvtl7nRQqbo1dwa0TLyCVKFAc1D57oeSpDEVhnCsdc9zE+NnWBlGOJ2W+6mmCSZjPKQ9SwWOqfaz2dVTdGqVEEVS2ScMmqm/OzIc6/w0WxljM9KLXi7+64U6H7iw3URXfsZEkhoqyHx5lHJkJMpjQSFTlBg+sQQTxez9iIywwsTY8Mo2GG8xhmXSrte881r97qLauC4iKsExnMAZeHAJDbiFJrSAgIJneIFX58l5c96dj3npilP0HMEfOJ8/iq6bPA==</latexit>

b1�<latexit sha1_base64="K70up/TAfM08v0R1sXRQ70Q+6FI=">AAACGHicbVDLSsNAFJ3UV62vqMtugkVwVRItPjZS0IXLCvYBTSmTyW07dPJg5kYpoQt/wx9wq3/gTty68wf8DpM0iFYPDBzOua85Tii4QtP80AoLi0vLK8XV0tr6xuaWvr3TUkEkGTRZIALZcagCwX1oIkcBnVAC9RwBbWd8kfrtW5CKB/4NTkLoeXTo8wFnFBOpr5ftO+7CiGJsexRHroqt6bRvX4JA2tcrZtXMYPwlVk4qJEejr3/absAiD3xkgirVtcwQezGVyJmAacmOFISUjekQugn1qQeqF2efmBr7ieIag0Amz0cjU392xNRTauI5SWV6qZr3UvFfz1XpwLntODjtxdwPIwSfzZYPImFgYKQpGS6XwFBMEkKZ5Mn9BhtRSRkmWZayYM5SHH/H8Je0DqvWUbV2XavUz/OIiqRM9sgBscgJqZMr0iBNwsg9eSRP5Fl70F60V+1tVlrQ8p5d8gva+xemmKFP</latexit> b1� � bd2

j (x) = 1<latexit sha1_base64="4pO302qxwLrpIKhTuEx83XixR9E=">AAACQHicbVDJSsRAFOy4O25Rj14aB0EvQ6LiclAEPXhUcJwBM4ZO58Vp7Sx0v6hDyAf5G/6AV/0B8SZePZnEQdwKGop69ZYuL5FCo2U9GQODQ8Mjo2PjtYnJqekZc3buVMep4tDksYxV22MapIigiQIltBMFLPQktLyr/bLeugalRRydYC+BTsguIhEIzrCQXHPfuRE+dBlmTsiw6+vMznM3cw5AIsupw4Xi9Mvj5+7l+epyZfWC7DZfoTvUds261bAq0L/E7pM66ePINZ8dP+ZpCBFyybQ+s60EOxlTKLiEvOakGhLGr9gFnBU0YiHoTlZ9NqdLheLTIFbFi5BW6veOjIVa90KvcJZn6t+1Uvy35uty4K/tGGx1MhElKULEP5cHqaQY0zJN6gsFHGWvIIwrUdxPeZcpxrHIvFYFs11i4yuGv+R0tWGvNdaP1+t7u/2IxsgCWSTLxCabZI8ckiPSJJzckQfySJ6Me+PFeDXePq0DRr9nnvyA8f4BVU2xSA==</latexit>

b1� � bd2i (x) = 0

<latexit sha1_base64="wIJiltBNraciwmH+7jNvrT2oh+Y=">AAACQHicbVDLSsNAFJ34rPVVdelmsAi6KYkWHwtF0IXLClYFU8NkcmOHTh7M3Kgl5IP8DX/Arf0BcSduXZnEIr4ODBzOPfcxx42l0GiaA2NkdGx8YrIyVZ2emZ2bry0snukoURzaPJKRunCZBilCaKNACRexAha4Es7d3mFRP78BpUUUnmI/hk7ArkPhC84wl5zaoX0rPOgyTO2AYdfTqZVlTmofgUSWUZsLxemXx8sccbWxVlpdP73L1ukeNZ1a3WyYJehfYg1JnQzRcmrPthfxJIAQuWRaX1pmjJ2UKRRcQla1Ew0x4z12DZc5DVkAupOWn83oaq541I9U/kKkpfq9I2WB1v3AzZ3Fmfp3rRD/rXm6GPhrO/o7nVSEcYIQ8s/lfiIpRrRIk3pCAUfZzwnjSuT3U95linHMM6+WwewW2PqK4S8522hYm43mSbN+sD+MqEKWyQpZIxbZJgfkmLRIm3ByTx7JExkYD8aL8Wq8fVpHjGHPEvkB4/0DUguxRg==</latexit>

ci<latexit sha1_base64="p/9SB+Z+Flv7fazv2mk7ZRzfqhs=">AAACB3icbVDLSsNAFL2pr1pfVZduBovgqiRafOyKblxWsA9oQplMJu3QySTMTIQS+gH+gFv9A3fi1s/wB/wOJ20QrR4YOJxz79zD8RPOlLbtD6u0tLyyulZer2xsbm3vVHf3OipOJaFtEvNY9nysKGeCtjXTnPYSSXHkc9r1x9e5372nUrFY3OlJQr0IDwULGcHaSK4bYT3yw4xMB2xQrdl1ewb0lzgFqUGB1qD66QYxSSMqNOFYqb5jJ9rLsNSMcDqtuKmiCSZjPKR9QwWOqPKyWeYpOjJKgMJYmic0mqk/NzIcKTWJfDOZZ1SLXi7+6wUq/3Dhug4vvIyJJNVUkPnxMOVIxygvBQVMUqL5xBBMJDP5ERlhiYk21VVmxVzmOPuu4S/pnNSd03rjtlFrXhUVleEADuEYHDiHJtxAC9pAIIFHeIJn68F6sV6tt/loySp29uEXrPcvDbCamA==</latexit>

x<latexit sha1_base64="674KEbwfo+XhsoKfua9LVagun0c=">AAACBXicbVDLSsNAFL2pr1pfVZduBovgqqRafOwKIrisYGuxDWUymbRDJ5MwMxFL6NofcKt/4E7c+h3+gN/hJA2irQcGDufcO/dw3IgzpW370yosLC4trxRXS2vrG5tb5e2dtgpjSWiLhDyUHRcrypmgLc00p51IUhy4nN66o4vUv72nUrFQ3OhxRJ0ADwTzGcHaSHe9AOuh6ycPk365YlftDGie1HJSgRzNfvmr54UkDqjQhGOlujU70k6CpWaE00mpFysaYTLCA9o1VOCAKifJEk/QgVE85IfSPKFRpv7eSHCg1DhwzWSaUM16qfiv56n0w5nr2j9zEiaiWFNBpsf9mCMdorQS5DFJieZjQzCRzORHZIglJtoUV8qKOU9x8lPDPGkfVWvH1fp1vdK4zCsqwh7swyHU4BQacAVNaAEBAU/wDC/Wo/VqvVnv09GCle/swh9YH9+ef5nU</latexit>

bd2j (x)

<latexit sha1_base64="lYOi043jK6mIJx/DTW05k5FK6Qo=">AAACGXicbVBJTsMwFHXKVMoUYAkLiwqpbKq0VAy7CjYsi0QHqQmR4zitqTPIdoAqyoZrcAG2cAN2iC0rLsA5SNIIQcuTLD299yc/K2BUSE37VApz8wuLS8Xl0srq2vqGurnVEX7IMWljn/m8ZyFBGPVIW1LJSC/gBLkWI11rdJ763VvCBfW9KzkOiOGigUcdipFMJFPd1e+oTYZIRnZs3lzXK7qL5NByovv4AJpqWatqGeAsqeWkDHK0TPVLt30cusSTmCEh+jUtkEaEuKSYkbikh4IECI/QgPQT6iGXCCPKfhHD/USxoePz5HkSZurvjgi5QoxdK6lMbxTTXir+69kiHTi1XTonRkS9IJTEw5PlTsig9GEaE7QpJ1iycUIQ5jS5H+Ih4gjLJMxSFsxpiqOfGGZJp16tHVYbl41y8yyPqAh2wB6ogBo4Bk1wAVqgDTB4AE/gGbwoj8qr8qa8T0oLSt6zDf5A+fgGbwahHg==</latexit>

bd2i (x)

<latexit sha1_base64="Uo1D1C3fjZMchdVR9X6qgRM9Ffo=">AAACGXicbVDLSsNAFJ34rPUVdamLwSLUTUlq8bErunFZwT6giWEymbRDJw9mJmoJ2fgb/oBb/QN34taVP+B3mKRBtPXAwOGc+5pjh4wKqWmfytz8wuLScmmlvLq2vrGpbm13RBBxTNo4YAHv2UgQRn3SllQy0gs5QZ7NSNceXWR+95ZwQQP/Wo5DYnpo4FOXYiRTyVL3jDvqkCGSsZNY9KZeNTwkh7Yb3yeH0FIrWk3LAWeJXpAKKNCy1C/DCXDkEV9ihoTo61oozRhxSTEjSdmIBAkRHqEB6afURx4RZpz/IoEHqeJAN+Dp8yXM1d8dMfKEGHt2WpndKKa9TPzXc0Q2cGq7dE/NmPphJImPJ8vdiEEZwCwm6FBOsGTjlCDMaXo/xEPEEZZpmOU8mLMMxz8xzJJOvaYf1RpXjUrzvIioBHbBPqgCHZyAJrgELdAGGDyAJ/AMXpRH5VV5U94npXNK0bMD/kD5+AZtX6Ed</latexit>

FIG. 6. Chart determination utilizes the composition of approximated distance function d2i and the indicator function 1∆ .

To implement 1∆ (a), we consider a basic step function g= 2ReLU(x−0.5(r2−4B2mν))−2ReLU(x−


r2 +4B2mν). It is straightforward to check

gk(a) = g◦ · · · ◦g︸︷︷︸k

(a) =

0 a < (1−2−k)(r2−4B2mν)

2k(a− r2 +4B2mν)+ r2−4B2mν a ∈[(1− 1

2k )(r2−4B2mν),r2−4B2mν

]

r2−4B2mν a > r2−4B2mν

.

Let 1∆ = 1− 1r2−4B2mν

gk. It suffices to choose k satisfying (1− 12k )(r2− 4B2mν) > r2−∆ + 4B2mν ,

which yields k =⌈

log r2

∆

⌉. We use 1∆ ◦ d2

i to approximate the indicator function on Ui:

• if x 6∈Ui, i.e., d2i (x)> r2, we have 1∆ ◦ d2

i (x) = 0;

• if x ∈Ui and d2i (x)6 r2−∆ , we have 1∆ ◦ d2

i (x) = 1.

Step 4. Taylor approximation. In each chart (Ui,φi), we locally approximate f using Taylorpolynomials of order n as shown in Figure 7. Specifically, we decompose f as

f =CM

∑i=1

fi with fi = f ρi,

where ρi is an element in a C∞ partition of unity on M which is supported inside Ui. The existence ofsuch a partition of unity is guaranteed by Proposition 2.6. Since M is compact and ρi is C∞, fi preservesthe regularity (smoothness) of f such that fi ∈H s,α(M ) for i = 1, . . . ,CM .

�i(Ui) ⇢ [0, 1]d<latexit sha1_base64="5ykNWlqa92MBD9AlISkFjfasK2Q=">AAACGXicbVDLSsNAFJ3UV62vqEtdDBahgpSkCrosuHFZwbSFJobJZNIOnUzCzEQooRt/wx9wq3/gTty68gf8DidtFlo9cOFwzr1z554gZVQqy/o0KkvLK6tr1fXaxubW9o65u9eVSSYwcXDCEtEPkCSMcuIoqhjpp4KgOGCkF4yvCr93T4SkCb9Vk5R4MRpyGlGMlJZ889BNR9SnDcenJ9CVWSCJggPrFNreXeibdatpzQD/ErskdVCi45tfbpjgLCZcYYakHNhWqrwcCUUxI9Oam0mSIjxGQzLQlKOYSC+fXTGFx1oJYZQIXVzBmfpzIkexlJM40J0xUiO56BXiv14oiwcXtqvo0sspTzNFOJ4vjzIGVQKLmGBIBcGKTTRBWFD9f4hHSCCsdJg1HYy9GMNf0m017bNm6+a83nbKiKrgAByBBrDBBWiDa9ABDsDgATyBZ/BiPBqvxpvxPm+tGOXMPvgF4+Mb72ifgw==</latexit>

�i(x)<latexit sha1_base64="W1GsS7Pt5tyoz3L5xLIvIt3QI4k=">AAACD3icbVDLSsNAFJ3UV62vWJduBotQNyXR4mNXcOOygqmFJoTJZNIOnTyYmUhL6Ef4A271D9yJWz/BH/A7nKRBtPXAwOGce+cejpcwKqRhfGqVldW19Y3qZm1re2d3T9+v90ScckwsHLOY9z0kCKMRsSSVjPQTTlDoMXLvja9z//6BcEHj6E5OE+KEaBjRgGIkleTqdTsZUZc27RDJkRdkk9mJqzeMllEALhOzJA1QouvqX7Yf4zQkkcQMCTEwjUQ6GeKSYkZmNTsVJEF4jIZkoGiEQiKcrMg+g8dK8WEQc/UiCQv190aGQiGmoacm84hi0cvFfz1f5B8uXJfBpZPRKEklifD8eJAyKGOYlwN9ygmWbKoIwpyq/BCPEEdYqgprRTFXOc5/algmvdOWedZq37YbHausqAoOwRFoAhNcgA64AV1gAQwm4Ak8gxftUXvV3rT3+WhFK3cOwB9oH9/BZp0b</latexit>

M ⇢ RD<latexit sha1_base64="R/WAjU9EhpIP3Y2jawZnoZiYULk=">AAACHXicbVDNSsNAGNzUv1r/oh69LBbBU0mqoMeCHrwIVUxbaGLZbDbt0s0m7G6EEnL1NXwBr/oG3sSr+AI+h5s2B20dWBhmvr8dP2FUKsv6MipLyyura9X12sbm1vaOubvXkXEqMHFwzGLR85EkjHLiKKoY6SWCoMhnpOuPLwq/+0CEpDG/U5OEeBEachpSjJSWBiZ0I6RGGLHsOoeuTH1J1Ezz/ew2v78cmHWrYU0BF4ldkjoo0R6Y324Q4zQiXGGGpOzbVqK8DAlFMSN5zU0lSRAeoyHpa8pRRKSXTX+SwyOtBDCMhX5cwan6uyNDkZSTyNeVxY1y3ivEf71AFgPntqvw3MsoT1JFOJ4tD1MGVQyLqGBABcGKTTRBWFB9P8QjJBBWOtCaDsaej2GRdJoN+6TRvDmtt5wyoio4AIfgGNjgDLTAFWgDB2DwCJ7BC3g1now34934mJVWjLJnH/yB8fkD48Ci4g==</latexit>

Ui<latexit sha1_base64="z9s72Pj1ZErQq2SdZIIJVOoJji8=">AAAB/nicbVDLSsNAFL2pr1pfVZduBovgqiRV0GXBjcuKpi20oUwmk3boZBJmJkIIBX/Arf6BO3Hrr/gDfoeTNgttPXDhcM59cfyEM6Vt+8uqrK1vbG5Vt2s7u3v7B/XDo66KU0moS2Iey76PFeVMUFczzWk/kRRHPqc9f3pT+L1HKhWLxYPOEupFeCxYyAjWRrp3R2xUb9hNew60SpySNKBEZ1T/HgYxSSMqNOFYqYFjJ9rLsdSMcDqrDVNFE0ymeEwHhgocUeXl81dn6MwoAQpjaUpoNFd/T+Q4UiqLfNMZYT1Ry14h/usFqli4dF2H117ORJJqKsjieJhypGNUZIECJinRPDMEE8nM/4hMsMREm8RqJhhnOYZV0m01nYtm6+6y0XbLiKpwAqdwDg5cQRtuoQMuEBjDM7zAq/VkvVnv1seitWKVM8fwB9bnD3pZlkk=</latexit>

fi � ��1i 2 Hs,↵

<latexit sha1_base64="NmhVJhXDQ+qX+qeb5hxD8fWIqFU=">AAACLnicbVDLSsNAFJ3UV62vqks3g0VwoSXR4mNXcNNlBfuApg0300k7dDIJMxOhhHyHv+EPuNU/EFyIO/EzTNoi2npg4HDOfc1xQ86UNs03I7e0vLK6ll8vbGxube8Ud/eaKogkoQ0S8EC2XVCUM0EbmmlO26Gk4LucttzRTea37qlULBB3ehzSrg8DwTxGQKeSU7Q8h2GbMEmwHQ6Zw3rxqZVgmwls+6CHBHhcS3qxOsE28HAIiVMsmWVzArxIrBkpoRnqTvHT7gck8qnQhINSHcsMdTcGqRnhNCnYkaIhkBEMaCelAnyquvHkawk+SpU+9gKZPqHxRP3dEYOv1Nh308rsXDXvZeK/Xl9lA+e2a++qGzMRRpoKMl3uRRzrAGfZ4T6TlGg+TgkQydL7MRmCBKLThAuTYK4zXPzEsEiaZ2XrvFy5rZSqjVlEeXSADtExstAlqqIaqqMGIugBPaFn9GI8Gq/Gu/ExLc0Zs5599AfG1zcgh6lB</latexit>

fi(x) 2 Hs,↵(M)<latexit sha1_base64="ZctcUrwm+gi0F3qRAY8WP1Ul27A=">AAACNXicbVDLSgMxFM3UV62vqks3wSK0IGXGB7oU3HQjVLAPcGq5k2ba0ExmSDJiGeZX/A1/wK3uXbiTbv0FM20RrR4InJxzb+7N8SLOlLbtNyu3sLi0vJJfLaytb2xuFbd3miqMJaENEvJQtj1QlDNBG5ppTtuRpBB4nLa84WXmt+6pVCwUN3oU0U4AfcF8RkAbqVs897us7AagB56fPKQV7DKBJ3cCPKmld4k6xC7waABp+Vu/SivdYsmu2hPgv8SZkRKaod4tjt1eSOKACk04KHXr2JHuJCA1I5ymBTdWNAIyhD69NVRAQFUnmfwwxQdG6WE/lOYIjSfqz44EAqVGgWcqsx3VvJeJ/3o9lT04N137552EiSjWVJDpcD/mWIc4ixD3mKRE85EhQCQz+2MyAAlEm6ALJhhnPoa/pHlUdY6rp9cnpYvGLKI82kP7qIwcdIYuUA3VUQMR9Iie0Qt6tZ6sd+vDGk9Lc9asZxf9gvX5BXkDrFs=</latexit>

FIG. 7. Locally approximate f in each chart (Ui,φi) using Taylor polynomials.

LEMMA 4.4 Suppose Assumption 3 holds. For i = 1, . . . ,CM , the function fi belongs to Hs,αM : there

exists a Holder coefficient Li depending on d, fi, and φi such that for any |s|= s, we have∣∣∣Ds( fi ◦φ

−1i )∣∣φi(x1)

−Ds( fi ◦φ−1i )∣∣φi(x2)

∣∣∣6 Li ‖φi(x1)−φi(x2)‖α

2 , ∀x1,x2 ∈Ui.

Proof Sketch. We provide a sketch here. Details can be found in Appendix B.1. Denote g1 = f ◦φ−1i

and g2 = ρi ◦φ−1i . By the Leibniz rule, we have

Ds( fi ◦φ−1i ) = Ds(g1×g2) = ∑

|p|+|q|=s

(s|p|

)Dpg1Dqg2.


Consider each term in the sum: for any x1,x2 ∈Ui,∣∣Dpg1Dqg2|φi(x1)−Dpg1Dqg2|φi(x2)

∣∣6 |Dpg1(φi(x1))|

∣∣Dqg2|φi(x1)−Dqg2|φi(x2)

∣∣+ |Dqg2(φi(x2))|∣∣Dpg1|φi(x1)−Dpg1|φi(x2)

∣∣6 λiθi,α ‖φi(x1)−φi(x2)‖α

2 +µiβi,α ‖φi(x1)−φi(x2)‖α

2 .

Here λi and µi are uniform upper bounds on the derivatives of g1 and g2 with order up to n, respectively.The last inequality above is derived as follows: by the mean value theorem, we have

∣∣Dqg2|φi(x1)−Dqg2|φi(x2)

∣∣6√

dµi ‖φi(x1)−φi(x2)‖2

=√

dµi ‖φi(x1)−φi(x2)‖1−α

2 ‖φi(x1)−φi(x2)‖α

2

6√

dµi(2r)1−α ‖φi(x1)−φi(x2)‖α

2 ,

where the last inequality is due to the fact that ‖φi(x1)−φi(x2)‖2 6 bi ‖Vi‖‖x1−x2‖2 6 2r. Then weset θi,α =

√dµi(2r)1−α and by a similar argument, we set βi,α =

√dλi(2r)1−α . We complete the proof

by taking Li = 2s+1√

dλiµi(2r)1−α . �Lemma 4.4 is crucial for the error estimation in the local approximation of fi ◦φ

−1i by Taylor poly-

nomials. This error estimate is given in the following theorem, where some of the proof techniques arefrom Theorem 1 in Yarotsky (2017).

THEOREM 4.1 Let fi = f ρi as in Step 4. For any δ ∈ (0,1), there exists a ReLU network structure that,if the weight parameters are properly chosen, the network yields an approximation of fi ◦φ

−1i uniformly

with error δ . Such a network has

1. no more than c(log 1

δ+1)

layers,

2. at most c′δ−d

s+α

(log 1

δ+1)

neurons and weight parameters,

where c,c′ depend on s,d, fi ◦φ−1i .

Proof Sketch. The detailed proof is provided in Appendix B.2. The proof consists of two steps:

1. Approximate fi ◦φ−1i using a weighted sum of Taylor polynomials;

2. Implement the weighted sum of Taylor polynomials using ReLU networks.

Specifically, we set up a uniform grid and divide [0,1]d into small cubes, and then approximate fi ◦φ−1i

by its s-th order Taylor polynomial in each cube. To implement such polynomials by ReLU networks,we recursively apply the multiplication × operator in Corollary 4.1, since these polynomials are sumsof the products of different variables. �

Step 5. Estimating the total error. We have collected all the ingredients to implement the entireReLU network to approximate f on M . Recall that the network structure consists of 3 main sub-networks as demonstrated in Figure 3. Let × be an approximation to the multiplication operator in thepairing sub-network with error η . Accordingly, the function given by the whole network is

f =CM

∑i=1×( fi, 1∆ ◦ d2

i ) with fi = fi ◦φi,

where fi is the approximation of fi ◦φ−1i using Taylor polynomials in Theorem 4.1. The total error can

be decomposed to three components according to the following theorem.


THEOREM 4.2 For any i = 1, . . . ,CM , we have ‖ f − f‖∞ 6 ∑CMi=1 (Ai,1 +Ai,2 +Ai,3), where

Ai,1 =∥∥×( fi, 1∆ ◦ d2

i )− fi× (1∆ ◦ d2i )∥∥

∞6 η ,

Ai,2 =∥∥ fi× (1∆ ◦ d2

i )− fi× (1∆ ◦ d2i )∥∥

∞6 δ ,

Ai,3 =∥∥ fi× (1∆ ◦ d2

i )− fi×1(x ∈Ui)∥∥

∞6

c(π +1)r(1− r/τ)

∆ for some constant c.

Here 1(x ∈Ui) is the indicator function on Ui. Theorem 4.2 is proved in Appendix B.3. In order toachieve an ε total approximation error, i.e., ‖ f − f‖∞ 6 ε , we need to control the errors in the three sub-networks. In other words, we need to decide ν for d2

i , ∆ for 1∆ , δ for fi, and η for ×. Note that Ai,1 isthe error from the pairing sub-network, Ai,2 is the approximation error in the Taylor approximation sub-network, and Ai,3 is the error from the chart determination sub-network. The error bounds on Ai,1,Ai,2

are straightforward from the constructions of × and fi. The estimate of Ai,3 involves some technicalanalysis since ‖1∆ ◦ d2

i −1(x ∈Ui)‖∞ = 1. Note that

1∆ ◦ d2i (x)−1(x ∈Ui) = 0

whenever ‖x− ci‖22 < r2−∆ or ‖x− ci‖2

2 > r2, so we only need to prove that | fi(x)| is sufficiently smallin the region Ki defined below.

LEMMA 4.5 For any i = 1, . . . ,CM , denote

Ki = {x ∈M : r2−∆ 6 ‖x− ci‖22 6 r2}.

Then there exists a constant c depending on fi’s and φi’s such that

maxx∈Ki| fi(x)|6

c(π +1)r(1− r/τ)

∆ .

Proof Sketch. The detailed proof is in Appendix B.4. The function fi ◦ φ−1i is defined on φi(Ui) ⊂

[0,1]d . We extend fi ◦φ−1i to [0,1]d by letting fi ◦φ

−1i (x) = 0 for x ∈ [0,1]d \φi(Ui). It is easy to verify

that such an extension preserves the regularity of fi ◦φ−1i , since supp( fi) is a compact subset of Ui. By

the mean value theorem, for any x,y ∈Ki, there exists z = βφi(x)+ (1−β )φi(y) for some β ∈ (0,1)such that

| fi(x)− fi(y)|6 ‖∇ fi ◦φ−1i (z)‖2‖φi(x)−φi(y)‖2 6 ‖∇ fi ◦φ

−1i (z)‖2bi‖Vi‖2‖x−y‖2.

We pick y ∈ ∂Ui (the boundary of Ui) so that fi(y) = 0. Since fi ∈H s,α(M ) and M is compact,∥∥∇ fi ◦φ−1i (z)

∥∥2 bi ‖Vi‖2 6 c for some c > 0. To bound | fi(x)|, the key is to estimate ‖x−y‖2. We next

prove that, for any x ∈Ki, there exists y ∈ ∂Ui satisfying

‖x−y‖2 6π +1

r(1− r/τ)∆ .

The idea is to consider a geodesic γ(t) parameterized by the arc length from x to ∂Ui in Figure 8. Ageodesic is the shortest path between two points on the manifold. We refer readers to Chapter 6 in Lee


(2006) for a formal introduction. Denote y = ∂Ui⋂

γ . Without loss of generality, we shift the center cito 0 in the following analysis. To utilize polar coordinates, we define two auxiliary quantities:

θ(t) = γ(t)>γ(t)/‖γ(t)‖2 and `(t) = ‖γ(t)‖2 ,

where γ denotes the derivative of γ .We show that there exists a geodesic γ(t) satisfying

inft

˙(t)>1− r/τ

π +1> 0.

This implies that the geodesic continuously moves away from the center ci. Denote T such that γ(T )= y.By the definition of geodesic, T is the arc length of γ(t) between x and y. We have

T inft

˙(t)6 `(T )− `(0)6 r−√

r2−∆ 6∆

r.

Therefore, we derive

‖x−y‖2 6 T 6∆

r inft ˙(t)6

π +1r(1− r/τ)

∆ .

�

ci<latexit sha1_base64="KamUmlzpOAAMpq5bo1PKbKGaNcU=">AAACB3icbVDLSsNAFL2pr1pfVZduBovgqiRVsMuCG5cV7AOaUCaTSTt0MgkzE6GEfoA/4Fb/wJ249TP8Ab/DSZqFth4YOJxz79zD8RPOlLbtL6uysbm1vVPdre3tHxwe1Y9P+ipOJaE9EvNYDn2sKGeC9jTTnA4TSXHkczrwZ7e5P3ikUrFYPOh5Qr0ITwQLGcHaSK4bYT31w4wsxmxcb9hNuwBaJ05JGlCiO65/u0FM0ogKTThWauTYifYyLDUjnC5qbqpogskMT+jIUIEjqrysyLxAF0YJUBhL84RGhfp7I8ORUvPIN5N5RrXq5eK/XqDyD1eu67DtZUwkqaaCLI+HKUc6RnkpKGCSEs3nhmAimcmPyBRLTLSprmaKcVZrWCf9VtO5arburxuddllRFc7gHC7BgRvowB10oQcEEniGF3i1nqw36936WI5WrHLnFP7A+vwB8tOaWg==</latexit>

x<latexit sha1_base64="SRl7GnPpXhXST9rQC7XJ9WdvmDg=">AAACBXicbVDLSsNAFL2pr1pfVZduBovgqiRVsMuCG5cV7APbUCaTSTt0MgkzE7GErv0Bt/oH7sSt3+EP+B1O0iy09cDA4Zx75x6OF3OmtG1/WaW19Y3NrfJ2ZWd3b/+genjUVVEiCe2QiEey72FFORO0o5nmtB9LikOP0543vc783gOVikXiTs9i6oZ4LFjACNZGuh+GWE+8IH2cj6o1u27nQKvEKUgNCrRH1e+hH5EkpEITjpUaOHas3RRLzQin88owUTTGZIrHdGCowCFVbponnqMzo/goiKR5QqNc/b2R4lCpWeiZySyhWvYy8V/PV9mHS9d10HRTJuJEU0EWx4OEIx2hrBLkM0mJ5jNDMJHM5EdkgiUm2hRXMcU4yzWskm6j7lzUG7eXtVazqKgMJ3AK5+DAFbTgBtrQAQICnuEFXq0n6816tz4WoyWr2DmGP7A+fwCCypmT</latexit>

y<latexit sha1_base64="wuTK6wRB5HjPTrVohu+9hGMAcfY=">AAACBXicbVDLSsNAFL2pr1pfVZduBovgqiS1YJcFNy4r2Ae2oUwmk3boZBJmJkIIXfsDbvUP3Ilbv8Mf8DuctFlo64GBwzn3zj0cL+ZMadv+skobm1vbO+Xdyt7+weFR9fikp6JEEtolEY/kwMOKciZoVzPN6SCWFIcep31vdpP7/UcqFYvEvU5j6oZ4IljACNZGehiFWE+9IEvn42rNrtsLoHXiFKQGBTrj6vfIj0gSUqEJx0oNHTvWboalZoTTeWWUKBpjMsMTOjRU4JAqN1sknqMLo/goiKR5QqOF+nsjw6FSaeiZyTyhWvVy8V/PV/mHK9d10HIzJuJEU0GWx4OEIx2hvBLkM0mJ5qkhmEhm8iMyxRITbYqrmGKc1RrWSa9Rd67qjbtmrd0qKirDGZzDJThwDW24hQ50gYCAZ3iBV+vJerPerY/laMkqdk7hD6zPH4RjmZQ=</latexit>

�(t)<latexit sha1_base64="Szqth62rUyn1jeubuKXYPXh1SkU=">AAACBHicbVDLTgIxFL2DL8QX6tJNIzHBDZlBE1mSuHGJiYAGJqRTOtDQdiZtx4RM2PoDbvUP3Bm3/oc/4HfYgVkoeJKbnJxzXzlBzJk2rvvlFNbWNza3itulnd29/YPy4VFHR4kitE0iHqn7AGvKmaRtwwyn97GiWAScdoPJdeZ3H6nSLJJ3ZhpTX+CRZCEj2FjpoT/CQuCqOR+UK27NnQOtEi8nFcjRGpS/+8OIJIJKQzjWuue5sfFTrAwjnM5K/UTTGJMJHtGepRILqv10/vAMnVlliMJI2ZIGzdXfEykWWk9FYDsFNmO97GXiv95QZwuXrpuw4adMxomhkiyOhwlHJkJZImjIFCWGTy3BRDH7PyJjrDAxNreSDcZbjmGVdOo176JWv72sNBt5REU4gVOoggdX0IQbaEEbCAh4hhd4dZ6cN+fd+Vi0Fpx85hj+wPn8AZe0mG8=</latexit>

`(t)<latexit sha1_base64="lsLN4nrfHXemYZOEAbnu/H09rQY=">AAACAnicbVC7SgNBFL0bXzG+opY2g0GITdiNgikDNpYRzAOSJczO3k2GzD6YmRVCSOcP2Oof2ImtP+IP+B3OJlto4oELh3Pui+Mlgitt219WYWNza3unuFva2z84PCofn3RUnEqGbRaLWPY8qlDwCNuaa4G9RCINPYFdb3Kb+d1HlIrH0YOeJuiGdBTxgDOqjdQdoBBVfTksV+yavQBZJ05OKpCjNSx/D/yYpSFGmgmqVN+xE+3OqNScCZyXBqnChLIJHWHf0IiGqNzZ4t05uTCKT4JYmoo0Wai/J2Y0VGoaeqYzpHqsVr1M/NfzVbZw5boOGu6MR0mqMWLL40EqiI5JlgfxuUSmxdQQyiQ3/xM2ppIybVIrmWCc1RjWSadec65q9fvrSrORR1SEMziHKjhwA024gxa0gcEEnuEFXq0n6816tz6WrQUrnzmFP7A+fwAMMZeV</latexit>

�(t)<latexit sha1_base64="ryt9nYc0+rCuTWiJJj1UezMFWCw=">AAACDHicbVDLSsNAFJ34rPXRqEs3g0Wom5JUwS4LblxWsA9oQ5lMJu3QmSTM3Agl9Bf8Abf6B+7Erf/gD/gdTtostPXAhcM598XxE8E1OM6XtbG5tb2zW9or7x8cHlXs45OujlNFWYfGIlZ9n2gmeMQ6wEGwfqIYkb5gPX96m/u9R6Y0j6MHmCXMk2Qc8ZBTAkYa2ZVhEEM2HBMpybwGlyO76tSdBfA6cQtSRQXaI/vbbKCpZBFQQbQeuE4CXkYUcCrYvDxMNUsInZIxGxgaEcm0ly0en+MLowQ4jJWpCPBC/T2REan1TPqmUxKY6FUvF//1Ap0vXLkOYdPLeJSkwCK6PB6mAkOM82RwwBWjIGaGEKq4+R/TCVGEgsmvbIJxV2NYJ91G3b2qN+6vq61mEVEJnaFzVEMuukEtdIfaqIMoStEzekGv1pP1Zr1bH8vWDauYOUV/YH3+ADR1m3c=</latexit>

✓(t)<latexit sha1_base64="IMzOhk8YSXLp/KDA5NVxEY3OvHk=">AAACBHicbVDLTgIxFO3gC/GFunTTSExwQ2bQRJYkblxiIqCBCemUDjS0nUl7x4RM2PoDbvUP3Bm3/oc/4HfYgVkoeJKbnJxzXzlBLLgB1/1yCmvrG5tbxe3Szu7e/kH58KhjokRT1qaRiPR9QAwTXLE2cBDsPtaMyECwbjC5zvzuI9OGR+oOpjHzJRkpHnJKwEoPfRgzIFU4H5Qrbs2dA68SLycVlKM1KH/3hxFNJFNABTGm57kx+CnRwKlgs1I/MSwmdEJGrGepIpIZP50/PMNnVhniMNK2FOC5+nsiJdKYqQxspyQwNsteJv7rDU22cOk6hA0/5SpOgCm6OB4mAkOEs0TwkGtGQUwtIVRz+z+mY6IJBZtbyQbjLcewSjr1mndRq99eVpqNPKIiOkGnqIo8dIWa6Aa1UBtRJNEzekGvzpPz5rw7H4vWgpPPHKM/cD5/ALZ1mII=</latexit>

r<latexit sha1_base64="cVLdqOIMLzJ5kor4GghebSD0Oik=">AAAB/HicbVDLSsNAFL3xWeur6tLNYBFclaQKdllw47IF+4A2lMnkph06eTAzEUqoP+BW/8CduPVf/AG/w0mbhbYeuHA45744XiK40rb9ZW1sbm3v7Jb2yvsHh0fHlZPTropTybDDYhHLvkcVCh5hR3MtsJ9IpKEnsOdN73K/94hS8Th60LME3ZCOIx5wRrWR2nJUqdo1ewGyTpyCVKFAa1T5HvoxS0OMNBNUqYFjJ9rNqNScCZyXh6nChLIpHePA0IiGqNxs8eicXBrFJ0EsTUWaLNTfExkNlZqFnukMqZ6oVS8X//V8lS9cua6DhpvxKEk1Rmx5PEgF0THJkyA+l8i0mBlCmeTmf8ImVFKmTV5lE4yzGsM66dZrznWt3r6pNhtFRCU4hwu4AgduoQn30IIOMEB4hhd4tZ6sN+vd+li2bljFzBn8gfX5AxYOlW0=</latexit>

pr2 ��

<latexit sha1_base64="Cg7KzyJpP54mbiO9Md+oZaWRLXY=">AAACEHicbVDLSsNAFJ34rPUV7dLNYBHcWJIq2GVBFy4r2Ac0sUwmk3bo5OHMjVBCf8IfcKt/4E7c+gf+gN/hpM1CWw9cOJxzXxwvEVyBZX0ZK6tr6xubpa3y9s7u3r55cNhRcSopa9NYxLLnEcUEj1gbOAjWSyQjoSdY1xtf5X73kUnF4+gOJglzQzKMeMApAS0NzIqjHiRk8r6Oz7BzzQSQ6cCsWjVrBrxM7IJUUYHWwPx2/JimIYuACqJU37YScDMigVPBpmUnVSwhdEyGrK9pREKm3Gz2/BSfaMXHQSx1RYBn6u+JjIRKTUJPd4YERmrRy8V/PV/lCxeuQ9BwMx4lKbCIzo8HqcAQ4zwd7HPJKIiJJoRKrv/HdEQkoaAzLOtg7MUYlkmnXrPPa/Xbi2qzUURUQkfoGJ0iG12iJrpBLdRGFE3QM3pBr8aT8Wa8Gx/z1hWjmKmgPzA+fwCVc5yz</latexit>

FIG. 8. A geometric illustration of θ and `.

Given Theorem 4.2, we choose

η = δ =ε

3CMand ∆ =

r(1− r/τ)ε

3c(π +1)CM(4.2)

so that the approximation error is bounded by ε . Moreover, wechoose

ν =∆

16B2D(4.3)

to guarantee ∆ > 8B2Dν so that the definition of 1∆ is valid.Finally we quantify the size of the ReLU network. Recall that

the chart determination sub-network has c1 log 1ν

layers, the Taylorapproximation sub-network has c2 log 1

δlayers, and the pairing sub-

network has c3 log 1η

layers. Here c2 depends on d,s, f , and c1,c3are absolute constants. Combining these with (4.2) and (4.3) yields the depth in Theorem 3.2. By asimilar argument, we can obtain the number of neurons and weight parameters. A detailed analysis isgiven in Appendix B.5.

5. Conclusion

We study nonparametric regression using deep ReLU neural networks, when data lie on a d-dimensionalmanifold M isometrically embedded in RD. Our result establishes an efficient recovery theory for gen-eral regression functions including Cs, Holder, and Sobolev functions supported on manifolds. We show

that the L2 loss for the estimation of f0 ∈H s,α(M ) converges in the order of n−s+α

2(s+α)+d . This implies

that, to obtain an ε-error estimation of f0, the sample complexity scales in the order of ε− 2(s+α)+d

s+α . Thisfast rate depending on d reveals that deep neural networks are adaptive to low-dimensional geometricstructures of data. Such results provide important insights in understanding why deep learning succeedin various real-world applications where data exhibit low-dimensional structures.


Acknowledgment

This work is supported by NSF grants DMS 1818751 and III 1717916.

REFERENCES

AAMARI, E., KIM, J., CHAZAL, F., MICHEL, B., RINALDO, A., WASSERMAN, L. ET AL. (2019). Estimatingthe reach of a manifold. Electronic Journal of Statistics, 13 1359–1399.

ALTMAN, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The AmericanStatistician.

AMODEI, D., ANANTHANARAYANAN, S., ANUBHAI, R., BAI, J., BATTENBERG, E., CASE, C., CASPER, J.,CATANZARO, B., CHENG, Q., CHEN, G. ET AL. (2016). Deep speech 2: End-to-end speech recognition inenglish and mandarin. In International conference on machine learning.

BAHDANAU, D., CHO, K. and BENGIO, Y. (2014). Neural machine translation by jointly learning to align andtranslate. arXiv preprint arXiv:1409.0473.

BARRON, A. R. (1991). Complexity regularization with application to artificial neural networks. In Nonparametricfunctional estimation and related topics. Springer, 561–576.

BARRON, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans-actions on Information theory, 39 930–945.

BICKEL, P. J. and LI, B. (2007). Local polynomial regression on unknown manifolds. Lecture Notes-MonographSeries, 54 177–186.

BICKEL, P. J., LI, B. ET AL. (2007). Local polynomial regression on unknown manifolds. In Complex datasetsand inverse problems. Institute of Mathematical Statistics, 177–186.

CHEN, M., JIANG, H., LIAO, W. and ZHAO, T. (2019). Efficient approximation of deep relu networks for functionson low dimensional manifolds. In Advances in Neural Information Processing Systems.

CHUI, C. K. and LI, X. (1992). Approximation by ridge functions and neural networks with one hidden layer.Journal of Approximation Theory, 70 131–141.

CHUI, C. K. and MHASKAR, H. N. (2016). Deep nets for local manifold learning. arXiv preprintarXiv:1607.07110.

COIFMAN, R. R. and MAGGIONI, M. (2006). Diffusion wavelets. Applied and Computational Harmonic Analysis,21 53–94.

CONWAY, J. H., SLOANE, N. J. A. and BANNAI, E. (1987). Sphere-packings, Lattices, and Groups. Springer-Verlag, Berlin, Heidelberg.

CYBENKO, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signalsand systems, 2 303–314.

DAUBECHIES, I., DEVORE, R., FOUCART, S., HANIN, B. and PETROVA, G. (2019). Nonlinear approximationand (deep) relu networks. arXiv preprint arXiv:1905.02199.

DEVORE, R. A., HOWARD, R. and MICCHELLI, C. (1989). Optimal nonlinear approximation. Manuscriptamathematica, 63 469–478.

DJURIC, N., ZHOU, J., MORRIS, R., GRBOVIC, M., RADOSAVLJEVIC, V. and BHAMIDIPATI, N. (2015). Hatespeech detection with comment embeddings. In Proceedings of the 24th international conference on worldwide web. ACM.

FAN, J. and GIJBELS, I. (1996). Local Polynomial Modelling and Its Applications. Monographs on statistics andapplied probability series, Chapman & Hall.

FUNAHASHI, K.-I. (1989). On the approximate realization of continuous mappings by neural networks. Neuralnetworks, 2 183–192.

GLOROT, X., BORDES, A. and BENGIO, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of thefourteenth international conference on artificial intelligence and statistics.

GOODFELLOW, I., BENGIO, Y. and COURVILLE, A. (2016). Deep Learning. MIT Press.


GOODFELLOW, I., POUGET-ABADIE, J., MIRZA, M., XU, B., WARDE-FARLEY, D., OZAIR, S., COURVILLE, A.and BENGIO, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems.

GRAVES, A., MOHAMED, A.-R. and HINTON, G. (2013). Speech recognition with deep recurrent neural networks.In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE.

GU, S., HOLLY, E., LILLICRAP, T. and LEVINE, S. (2017). Deep reinforcement learning for robotic manipulationwith asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation(ICRA). IEEE.

GYORFI, L., KOHLER, M., KRZYZAK, A. and WALK, H. (2006). A distribution-free theory of nonparametricregression. Springer Science & Business Media.

HAMERS, M. and KOHLER, M. (2006). Nonasymptotic bounds on the l 2 error of neural network regressionestimates. Annals of the Institute of Statistical Mathematics, 58 131–151.

HANIN, B. (2017). Universal function approximation by deep neural nets with bounded width and relu activations.arXiv preprint arXiv:1708.02691.

HINTON, G. E. and SALAKHUTDINOV, R. R. (2006). Reducing the dimensionality of data with neural networks.science, 313 504–507.

HORNIK, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks, 4 251–257.HU, J., SHEN, L. and SUN, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition.IRIE, B. and MIYAKE, S. (1988). Capabilities of three-layered perceptrons. In IEEE International Conference on

Neural Networks, vol. 1.JIANG, F., JIANG, Y., ZHI, H., DONG, Y., LI, H., MA, S., WANG, Y., DONG, Q., SHEN, H. and WANG,

Y. (2017). Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology, 2230–243.

KIM, Y., OHN, I. and KIM, D. (2018). Fast convergence rates of deep neural networks for classification. arXivpreprint arXiv:1812.03599.

KOHLER, M. and KRZYZAK, A. (2005). Adaptive regression estimation with multilayer feedforward neural net-works. Nonparametric Statistics, 17 891–913.

KOHLER, M. and KRZYZAK, A. (2016). Nonparametric regression based on hierarchical interaction models. IEEETransactions on Information Theory, 63 1620–1630.

KOHLER, M. and MEHNERT, J. (2011). Analysis of the rate of convergence of least squares neural networkregression estimates in case of measurement errors. Neural Networks, 24 273–279.

KPOTUFE, S. (2011). k-NN regression adapts to local intrinsic dimension. In Advances in Neural InformationProcessing Systems 24. 729–737.

KPOTUFE, S. and GARG, V. K. (2013). Adaptivity to local smoothness and dimension in kernel regression. InAdvances in Neural Information Processing Systems 26. 3075–3083.

KRIZHEVSKY, A., SUTSKEVER, I. and HINTON, G. E. (2012). Imagenet classification with deep convolutionalneural networks. In Advances in neural information processing systems.

LEE, J. M. (2006). Riemannian manifolds: an introduction to curvature, vol. 176. Springer Science & BusinessMedia.

LESHNO, M., LIN, V. Y., PINKUS, A. and SCHOCKEN, S. (1993). Multilayer feedforward networks with anonpolynomial activation function can approximate any function. Neural networks, 6 861–867.

LONG, J., SHELHAMER, E. and DARRELL, T. (2015). Fully convolutional networks for semantic segmentation.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

LU, Z., PU, H., WANG, F., HU, Z. and WANG, L. (2017). The expressive power of neural networks: A view fromthe width. In Advances in Neural Information Processing Systems.

MAAS, A. L., HANNUN, A. Y. and NG, A. Y. (2013). Rectifier nonlinearities improve neural network acousticmodels. In Proc. icml, vol. 30.

MCCAFFREY, D. F. and GALLANT, A. R. (1994). Convergence rates for single hidden layer feedforward networks.


Neural Networks, 7 147–158.MHASKAR, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural

computation, 8 164–177.MIOTTO, R., WANG, F., WANG, S., JIANG, X. and DUDLEY, J. T. (2017). Deep learning for healthcare: review,

opportunities and challenges. Briefings in bioinformatics, 19 1236–1246.NAIR, V. and HINTON, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings

of the 27th international conference on machine learning (ICML-10).NIYOGI, P., SMALE, S. and WEINBERGER, S. (2008). Finding the homology of submanifolds with high confidence

from random samples. Discrete & Computational Geometry, 39 419–441.OHN, I. and KIM, Y. (2019). Smooth function approximation by deep neural networks with general activation

functions. Entropy, 21 627.OSHER, S., SHI, Z. and ZHU, W. (2017). Low dimensional manifold model for image processing. SIAM Journal

on Imaging Sciences, 10 1669–1690.PANAYOTOV, V., CHEN, G., POVEY, D. and KHUDANPUR, S. (2015). Librispeech: an asr corpus based on public

domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE.

ROWEIS, S. T. and SAUL, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science,290 2323–2326.

SCHMIDT-HIEBER, J. (2017). Nonparametric regression using deep neural networks with relu activation function.arXiv preprint arXiv:1708.06633.

SHAHAM, U., CLONINGER, A. and COIFMAN, R. R. (2018). Provable approximation properties for deep neuralnetworks. Applied and Computational Harmonic Analysis, 44 537–557.

SHEN, Z., YANG, H. and ZHANG, S. (2019). Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497.

SUZUKI, T. (2019). Adaptivity of deep reLU network for learning in besov and mixed smooth besov spaces:optimal rate and curse of dimensionality. In International Conference on Learning Representations.

TAN, M. and LE, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. arXivpreprint arXiv:1905.11946.

TENENBAUM, J. B., DE SILVA, V. and LANGFORD, J. C. (2000). A global geometric framework for nonlineardimensionality reduction. Science, 290 2319–2323.

TSYBAKOV, A. B. (2008). Introduction to Nonparametric Estimation. 1st ed. Springer Publishing Company,Incorporated.

TU, L. (2010). An Introduction to Manifolds. Universitext, Springer New York.VAN DER VAART, A., VAN DER VAART, A., VAN DER VAART, A. and WELLNER, J. (1996). Weak Convergence

and Empirical Processes: With Applications to Statistics. Springer Series in Statistics, Springer.WAHBA, G. (1990). Spline models for observational data, vol. 59. Siam.WASSERMAN, L. (2006). All of nonparametric statistics. Springer Science & Business Media.YANG, Y., TOKDAR, S. T. ET AL. (2015). Minimax-optimal nonparametric regression in high dimensions. The

Annals of Statistics, 43 652–674.YAROTSKY, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94 103–114.YOUNG, T., HAZARIKA, D., PORIA, S. and CAMBRIA, E. (2018). Recent trends in deep learning based natural

language processing. ieee Computational intelligenCe magazine, 13 55–75.ZHOU, D.-X. (2019). Universality of deep convolutional neural networks. Applied and Computational Harmonic

Analysis.


Supplementary Materials for Nonparametric Regression onLow-Dimensional Manifolds using Deep ReLU Networks

A. Proofs of Preliminary Results in Section 4

A.1 Proof of Lemma 4.3

Proof. We partition the interval [0,1] uniformly into 2N subintervals Ik = [ k2N ,

k+12N ] for k = 0, . . . ,2N−1.

We approximate f (x) = x2 on these subintervals by a linear interpolation

fk =2k+1

2N

(x− k

2N

)+

k2

22N , for x ∈ Ik.

It is straightforward to check that fk meets f at the endpoints k2N ,

k+12N of Ik.

We evaluate the approximation error of fk on the interval Ik:

maxx∈Ik

∣∣∣ f (x)− fk(x)∣∣∣= max

x∈Ik

∣∣∣∣x2− 2k+12N x+

k2 + k22N

∣∣∣∣

= maxx∈Ik

∣∣∣∣∣

(x− 2k+1

22N

)2

− 124N

∣∣∣∣∣

=1

24N .

Note that this approximation error does not depend on k. Thus, in order to achieve an ε approximationerror, we only need

124N 6 ε =⇒ N >

log 1ε

4.

Let N = d log 1ε

4 e and denote fN = ∑2N−1k=0 fk1{x ∈ Ik}. We compute the increment from fN−1 to fN for

x ∈[

k2N−1 ,

k+12N−1

]as follows,

fN−1− fN =

k2

22(N−1) +2k+12N−1

(x− k

2N−1

)− k2

22(N−1) − 4k+12N

(x− k

2N−1

), x ∈

[k

2N−1 ,2k+1

2N

)

k2

22(N−1) +2k+12N−1

(x− k

2N−1

)− (2k+1)2

22N − 4k+32N

(x− 2k+1

2N

), x ∈

[2k+1

2N , k+12N−1

)

=

12N x− k

22N−1 , x ∈[

k2N−1 ,

2k+12N

)

− 12N x+ k+1

22N−1 , x ∈[

2k+12N , k+1

2N−1

) .

We observe that fN−1− fN is a triangular function on[

k2N−1 ,

k+12N−1

]. The maximum is 1

22N independent

of k attained at x = 2k+12N . The minimum is 0 attained at the endpoints k

2N−1 ,k+12N−1 . To implement fN , we

consider a triangular function representable by a one-layer ReLU network:

g(x) = 2σ(x)−4σ(x−0.5)+2σ(x−1).


Denote by gm = g◦g◦· · ·◦g the composition of totally m functions g. Observe that gm is a sawtooth func-tion with 2m−1 peaks at 2k+1

2m for k = 0, . . . ,2m−1−1, and we have gm( 2k+1

2m

)= 1 for k = 0, . . . ,2m−1−1.

Then we have fN−1− fN = 122N gN . By induction, we have

fN = fN−1−1

22N gN

= fN−2−1

22N gN−1

22N−2 gN−1

= · · ·

= x−N

∑k=1

122k gk.

Therefore, fN can be implemented by a ReLU network of depth⌈

log 1ε

4

⌉6 c log 1

εfor an absolute con-

stant c. Each layer consists of at most 3 neurons, hence, the total number of neurons and weight param-eters is no more than c′ log 1

ε. �

A.2 Proof of Corollary 4.1

Proof. Let fδ be an approximation of the quadratic function on [0,1] with error δ ∈ (0,1). We set

×(x,y) =C2(

fδ

( |x+ y|2C

)− fδ

( |x− y|2C

)).

Now we determine δ . We bound the error of ×

∣∣×(x,y)− xy∣∣=C2

∣∣∣∣ fδ

( |x+ y|2C

)− |x+ y|2

4C2 − fδ

( |x− y|2C

)+|x− y|2

4C2

∣∣∣∣

6C2∣∣∣∣ fδ

( |x+ y|2C

)− |x+ y|2

4C2

∣∣∣∣+∣∣∣∣ fδ

( |x− y|2C

)− |x− y|2

4C2

∣∣∣∣6 2C2

δ .

Thus, we pick δ = ε

2C2 to ensure∣∣×(x,y)− xy

∣∣ 6 ε for any inputs x and y. As shown in Lemma 4.3,

we can implement fδ using a ReLU network of depth at most c′ log 1δ= c log C2

εwith absolute constants

c′,c. The proof is complete. �

B. Proof of Approximation Theory of ReLU Network (Theorem 3.2)

B.1 Proof of Lemma 4.4

Proof. We rewrite fi ◦φ−1i as

( f ◦φ−1i )︸︷︷︸

g1

×(ρi ◦φ−1i )︸︷︷︸

g2

. (A.1)


By the definition of the partition of unity, we know g2 is C∞. This implies that g2 is (s+1) continuouslydifferentiable. Since supp(ρi) is compact, the k-th derivative of g2 is uniformly bounded by λi,k for anyk 6 s+1. Let λi = maxk6n+1 λi,k. We have for any |n|6 n and x1,x2 ∈Ui,

|Dng2(φi(x1))−Dng2(φi(x2))|6√

dλi ‖φi(x1)−φi(x2)‖2

6√

dλib1−α

i ‖x1−x2‖1−α

2 ‖φi(x1)−φi(x2)‖α

2 .

The last inequality follows from φi(x) = bi(V>i (x−ci)+ui) and ‖Vi‖2 = 1. Observe that Ui is bounded,hence, we have ‖x1−x2‖1−α

2 6 (2r)1−α . Absorbing ‖x1−x2‖1−α

2 into√

dλib1−α

i , we have the deriva-tive of g2 is Holder continuous. We denote βi,α =

√dλib1−α

i (2r)1−α 6√

dλi(2r)1−α . Similarly, g1is Cs−1 by Assumption 3. Then there exists a constant µi such that the k-th derivative of g1 is uni-formly bounded by µi for any k 6 n−1. These derivatives are also Holder continuous with coefficientθi,α 6

√dµi(2r)1−α .

By the Leibniz rule, for any |n|= n, we expand the n-th derivative of fi ◦φ−1i as

Dn(g1×g2) = ∑|p|+|q|=n

(n|p|

)Dpg1Dqg2.

Consider each summand in the above right-hand side. For any x1,x2 ∈Ui, we derive

∣∣Dpg1(φi(x1))Dqg2(φi(x1))−Dpg1(φi(x2))Dqg2(φi(x2))∣∣

=∣∣Dpg1(φi(x1))Dqg2(φi(x1))−Dpg1(φi(x1))Dqg2(φi(x2))

+Dpg1(φi(x1))Dqg2(φi(x2))−Dpg1(φi(x2))Dqg2(φi(x2))∣∣

6|Dpg1(φi(x1))||Dqg2(φi(x1))−Dqg2(φi(x2))|+ |Dqg2(φi(x2))||Dpg1(φi(x1))−Dpg1(φi(x2))|

6µiθi,α ‖φi(x1)−φi(x2)‖α

2 +λiβi,α ‖φi(x1)−φi(x2)‖α

2

62√

dµiλi(2r)1−α ‖φi(x1)−φi(x2)‖α

2 .

Observe that there are totally 2n summands in the right hand side of (A.1). Therefore, for any x1,x2 ∈Uiand |n|= n, we have

∣∣∣Dn( fi ◦φ−1i )∣∣φi(x1)

−Dn( fi ◦φ−1i )∣∣φi(x2)

∣∣∣6 2n+1√

dµiλi(2r)1−α ‖φi(x1)−φi(x2)‖α

2 .

�

B.2 Proof of Theorem 4.1

Proof. The proof consists of two steps. We first approximate fi ◦φ−1i by a Taylor polynomial, and then

implement the Taylor polynomial using a ReLU network. To ease the analysis, we extend fi ◦ φ−1i to

the whole cube [0,1]d by assigning fi ◦ φ−1i (x) = 0 for φi(x) ∈ [0,1]d \ φi(Ui). It is straightforward to

check that this extension preserves the regularity of fi ◦φ−1i , since fi vanishes on the complement of the

compact set supp(ρi)⊂Ui. For notational simplicity, we denote f φ

i = fi ◦φ−1i with the extension.


Step 1. We define a trapezoid function

ψ(x) =

1 |x|< 12−|x| 16 |x|6 20 |x|> 2

.

Note that we have ‖ψ‖∞= 1. Let N be a positive integer, we form a uniform grid on [0,1]d by dividing

each coordinate into N subintervals. We then consider a partition of unity on these grid defined by

ζm(x) =d

∏k=1

ψ

(3N(

xk−mk

N

)).

We can check that ∑m ζm(x) = 1 as in Figure A.9.

�3N�xk � mk

N

��<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�3N�xk � mk+1

N

��<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

xk<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

mk

N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

mk+1N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

FIG. A.9. Illustration of the construction of ζm on the k-th coordinate.

We also observe that supp(ζm) ={

x :∣∣xk− mk

N

∣∣6 1N ,k = 1, . . . ,d

}. Now we construct a Taylor

polynomial of degree s for approximating f φ

i at mN :

Pm(x) = ∑|s|6s

Ds f φ

is!

∣∣∣∣x=m

N

(x− m

N

)s.

Define fi = ∑m∈{0,...,N}d ζmPm. We bound the approximation error∥∥∥ fi− f φ

i

∥∥∥∞

:

maxx∈[0,1]d

∣∣∣ fi(x)− f φ

i (x)∣∣∣= max

x

∣∣∣∣∑m

φm(x)(Pm(x)− f φ

i (x))∣∣∣∣

6maxx ∑

m:|xk−mkN |6 1

N

∣∣∣Pm(x)− f φ

i (x)∣∣∣

6maxx

2d maxm:|xk−

mkN |6 1

N

∣∣∣Pm(x)− f φ

i (x)∣∣∣

6maxx

2dds

s!

(1N

)s

max|s|=s

∣∣∣Ds f φ

i

∣∣mN−Ds f φ

i

∣∣y

∣∣∣

6maxx

2dds

s!

(1N

)s

2s+1√

dµiλi(2r)1−α

∥∥∥mN−x∥∥∥

α

2

6√

dµiλi(2r)1−α 2d+s+1ds+α/2

s!

(1N

)s+α

.


Here y is the linear interpolation of mN and x, determined by the Taylor remainder. The second last

inequality is obtained by the Holder continuity in Lemma 4.4. By setting

√dµiλi(2r)1−α 2d+s+1ds+α/2

s!

(1N

)s+α

6δ

2,

we get N >(√

dµiλi(2r)1−α 2d+s+2ds+α/2

δ s!

) 1s+α

. Accordingly, the approximation error is bounded by ‖ fi−f φ

i ‖∞ 6 δ

2 .Step 2. We next implement fi by a ReLU network that approximates fi up to an error δ

2 . We denote

Pm(x) = ∑|s|6s

am,s

(x− m

N

)s,

where am,s =Ds f φ

is!

∣∣∣∣x=m

N

. Then we rewrite fi as

fi(x) = ∑m∈{0,...,N}d

∑|s|6s

am,sζm(x)(

x− mN

)s. (A.2)

Note that (A.2) is a linear combination of products ζm(x− m

N

)s. Each product involves at most d + nunivariate terms: d terms for ζm and n terms for

(x− m

N

)s. We recursively apply Corollary 4.1 toimplement the product. Specifically, let ×ε be the approximation of the product operator in Corollary4.1 with error ε , which will be chosen later. Consider the following chain application of ×ε :

fm,s(x) = ×ε

(ψ(3Nx1−3m1),×ε

(. . . ,×ε

(ψ(3Ndxd−md),×ε

(x1−

m1

N, . . .))))

.

Now we estimate the error of the above approximation. Note that we have |ψ(3Nxk− 3mk)| 6 1 and∣∣xk− mkN

∣∣6 1 for all k ∈ {1, . . . ,d} and x ∈ [0,1]d . We then have

∣∣∣ fm,s(x)−ζm

(x− m

N

)s∣∣∣=∣∣∣∣×ε

(ψ(3Nx1−3m1),×ε

(. . . ,×ε(x1−

m1

N, . . .)

))−ζm

(x− m

N

)s∣∣∣∣

6∣∣×ε

(ψ(3Nx1−3m1),×ε(ψ(3Nx2−3m2), . . .)

)

−ψ(3N1−3m1)×ε(ψ(3Nx2−3m2), . . .)∣∣

+ |ψ(3Nx1−m1)|∣∣×ε(ψ(3Nx2−3m2), . . .)−ψ(3Nx2−3m2)×ε(. . .)

∣∣+ . . .

6 (s+n)δ .

Moreover, we have fm,s(x) = ζm(x− m

N

)s= 0, if x 6∈ supp(ζm). Now we define

fi = ∑m∈{0,...,N}d

∑|s|6s

am,s fm,s.


The approximation error is bounded by

maxx

∣∣∣ fi(x)− fi(x)∣∣∣=

∣∣∣∣∣∣ ∑m∈{0,...,N}d

∑|s|6n

am,s

(fm,n(x)−ζm

(x− m

N

)s)∣∣∣∣∣∣

6maxx

λiµi2d+s+1 maxm:x∈supp(ζm)

∑|s|6s

∣∣∣ fm,s(x)−ζm

(x− m

N

)s∣∣∣

6 λiµi2d+s+1ds(d + s)ε.

We choose ε = δ

λiµi2d+s+2ds(d+s) , so that ‖ fi− fi‖∞ 6 δ

2 . Thus, we eventually have ‖ fi− f φ

i ‖∞ 6 δ .

Now we compute the depth and computational units for implement fi. fi can be implemented by acollection of parallel sub-networks that compute each fm,s. The total number of parallel sub-networksis bounded by ds(N +1)d . For each sub-network, we observe that ψ can be exactly implemented by asingle layer ReLU network, i.e., ψ(x) = ReLU(x+ 2)−ReLU(x+ 1)−ReLU(x− 1)+ReLU(x− 2).Corollary 4.1 shows that ×ε can be implemented by a depth c1 log 1

εReLU network. Therefore, the

whole network for implementing fi has no more than c′1(log 1

ε+1)

layers and c′1ds(N +1)d(log 1

ε+1)

neurons and weight parameters. With ε = δ

λiµi2d+s+2ds(d+s) and N =

⌈(µiλi(2r)1−α 2d+s+2ds+α/2

δ s!

) 1s+α

⌉, we

obtain that the whole network has no more than c1 log 1δ

layers, and at most c2δ− d

s+α

(log 1

δ+1)

neuronsand weight parameters, for constants c1,c2 depending on d,s, and fi ◦φ

−1i . �

B.3 Proof of Theorem 4.2

Proof. We expand the estimation error as

∥∥∥ f − f∥∥∥

∞

=

∥∥∥∥∥CM

∑i=1×( fi, 1∆ ◦ d2

i )− f

∥∥∥∥∥∞

=

∥∥∥∥∥CM

∑i=1×( fi, 1∆ ◦ d2

i )− f ρi1(x ∈Ui)

∥∥∥∥∥∞

6CM

∑i=1

∥∥∥×( fi, 1∆ ◦ d2i )− fi1(x ∈Ui)

∥∥∥∞

6CM

∑i=1

∥∥∥×( fi, 1∆ ◦ d2i )− fi · (1∆ ◦ d2

i )+ fi · (1∆ ◦ d2i )− fi · (1∆ ◦ d2

i )+ fi · (1∆ ◦ d2i )− fi ·1(x ∈Ui)

∥∥∥∞

6CM

∑i=1

∥∥∥×( fi, 1∆ ◦ d2i )− fi× (1∆ ◦ d2

i )∥∥∥

∞︸︷︷︸Ai,1

+∥∥∥ fi× (1∆ ◦ d2

i )− fi× (1∆ ◦ d2i )∥∥∥

∞︸︷︷︸Ai,2

+∥∥∥ fi× (1∆ ◦ d2

i )− fi×1(x ∈Ui)∥∥∥

∞︸︷︷︸Ai,3

.


The first two terms Ai,1,Ai,2 are straightforward to handle, since by the construction we have

Ai,1 =∥∥∥×( fi, 1∆ ◦ d2

i )− fi · (1∆ ◦ d2i )∥∥∥

∞

6 η , and

Ai,2 =∥∥∥ fi× (1∆ ◦ d2

i )− fi · (1∆ ◦ d2i )∥∥∥

∞

6∥∥∥ fi− fi

∥∥∥∞

∥∥∥1∆ ◦ d2i

∥∥∥∞

6 δ .

By Lemma 4.5, we have maxx∈Ki | fi(x)|6 c(π+1)r(1−r/τ)∆ for a constant c depending on fi. Then we bound

Ai,3 as

Ai,3 =∥∥∥ fi× (1∆ ◦ d2

i )− fi×1(x ∈Ui)∥∥∥

∞

6 maxx∈Ki| fi(x)|6

c(π +1)r(1− r/τ)

∆ .

�

B.4 Proof of Lemma 4.5

Proof. We extend fi ◦ φ−1i to the whole cube [0,1]d as in the proof of Theorem 4.1. We also have

fi(x) = 0 for ‖x− ci‖2 = r. By the first order Taylor expansion, we have for any x,y ∈Ui

| fi(x)− fi(y)|=∣∣ fi ◦φ

−1i (φi(x))− fi ◦φ

−1i (φi(y))

∣∣6∥∥∇( fi ◦φ

−1i )(z)

∥∥2 ‖φi(x)−φi(y)‖2

6∥∥∇( fi ◦φ

−1i )(z)

∥∥2 bi ‖Vi‖2 ‖x−y‖2 ,

where z is a linear interpolation of φi(x) and φi(y) satisfying the mean value theorem. Since fi ◦φ−1i is

Cs in [0,1]d , the first derivative is uniformly bounded, i.e.,∥∥∇ fi ◦φ

−1i (z)

∥∥2 6 αi for any z ∈ [0,1]d . Let

y ∈Ui satisfying fi(y) = 0. In order to bound the function value for any x ∈Ki, we only need to boundthe Euclidean distance between x and y. More specifically, for any x ∈Ki, we need to show that thereexists y ∈Ui satisfying fi(y) = 0, such that ‖x−y‖2 is sufficiently small.

Before continuing with the proof, we introduce some notations. Let γ(t) be a geodesic on Mparameterized by the arc length. In the following context, we use γ and γ to denote the first and secondderivatives of γ with respect to t. By the definition of geodesic, we have ‖γ(t)‖2 = 1 (unit speed) andγ(t)⊥ γ(t).

Without loss of generality, we shift ci to 0. We consider a geodesic starting from x with initial“velocity” γ(0) = v in the tangent space of M at x. To utilize polar coordinate, we define two auxiliary

quantities: `(t) = ‖γ(t)‖2 and θ(t) = arccos γ(t)> γ(t)‖γ(t)‖2

∈ [0,π]. As can be seen in Figure 8, ` and θ haveclear geometrical interpretations: ` is the radial distance from the center ci, and θ is the angle betweenthe velocity and γ(t).

Suppose y = γ(T ), we need to upper bound T . Note that `(T )− `(0)6 r−√

r2−∆ 6 ∆/r. More-over, observe that the derivative of ` is ˙(t) = cosθ(t), since γ has unit speed. It suffices to find a lowerbound on ˙(t) = cosθ(t) so that T 6 ∆

r inft ˙(t).

We immediately have the second derivative of ` as ¨(t) = −sinθ(t)θ(t). Meanwhile, using theequation `(t) =

√γ(t)>γ(t), we also have

¨(t) =

(γ(t)>γ(t)+ γ(t)>γ(t)

)√γ(t)>γ(t)−

(γ(t)>γ(t)

)2/√

γ(t)>γ(t)γ(t)>γ(t)

. (A.3)


ci<latexit sha1_base64="KamUmlzpOAAMpq5bo1PKbKGaNcU=">AAACB3icbVDLSsNAFL2pr1pfVZduBovgqiRVsMuCG5cV7AOaUCaTSTt0MgkzE6GEfoA/4Fb/wJ249TP8Ab/DSZqFth4YOJxz79zD8RPOlLbtL6uysbm1vVPdre3tHxwe1Y9P+ipOJaE9EvNYDn2sKGeC9jTTnA4TSXHkczrwZ7e5P3ikUrFYPOh5Qr0ITwQLGcHaSK4bYT31w4wsxmxcb9hNuwBaJ05JGlCiO65/u0FM0ogKTThWauTYifYyLDUjnC5qbqpogskMT+jIUIEjqrysyLxAF0YJUBhL84RGhfp7I8ORUvPIN5N5RrXq5eK/XqDyD1eu67DtZUwkqaaCLI+HKUc6RnkpKGCSEs3nhmAimcmPyBRLTLSprmaKcVZrWCf9VtO5arburxuddllRFc7gHC7BgRvowB10oQcEEniGF3i1nqw36936WI5WrHLnFP7A+vwB8tOaWg==</latexit>

x<latexit sha1_base64="SRl7GnPpXhXST9rQC7XJ9WdvmDg=">AAACBXicbVDLSsNAFL2pr1pfVZduBovgqiRVsMuCG5cV7APbUCaTSTt0MgkzE7GErv0Bt/oH7sSt3+EP+B1O0iy09cDA4Zx75x6OF3OmtG1/WaW19Y3NrfJ2ZWd3b/+genjUVVEiCe2QiEey72FFORO0o5nmtB9LikOP0543vc783gOVikXiTs9i6oZ4LFjACNZGuh+GWE+8IH2cj6o1u27nQKvEKUgNCrRH1e+hH5EkpEITjpUaOHas3RRLzQin88owUTTGZIrHdGCowCFVbponnqMzo/goiKR5QqNc/b2R4lCpWeiZySyhWvYy8V/PV9mHS9d10HRTJuJEU0EWx4OEIx2hrBLkM0mJ5jNDMJHM5EdkgiUm2hRXMcU4yzWskm6j7lzUG7eXtVazqKgMJ3AK5+DAFbTgBtrQAQICnuEFXq0n6816tz4WoyWr2DmGP7A+fwCCypmT</latexit>

y<latexit sha1_base64="wuTK6wRB5HjPTrVohu+9hGMAcfY=">AAACBXicbVDLSsNAFL2pr1pfVZduBovgqiS1YJcFNy4r2Ae2oUwmk3boZBJmJkIIXfsDbvUP3Ilbv8Mf8DuctFlo64GBwzn3zj0cL+ZMadv+skobm1vbO+Xdyt7+weFR9fikp6JEEtolEY/kwMOKciZoVzPN6SCWFIcep31vdpP7/UcqFYvEvU5j6oZ4IljACNZGehiFWE+9IEvn42rNrtsLoHXiFKQGBTrj6vfIj0gSUqEJx0oNHTvWboalZoTTeWWUKBpjMsMTOjRU4JAqN1sknqMLo/goiKR5QqOF+nsjw6FSaeiZyTyhWvVy8V/PV/mHK9d10HIzJuJEU0GWx4OEIx2hvBLkM0mJ5qkhmEhm8iMyxRITbYqrmGKc1RrWSa9Rd67qjbtmrd0qKirDGZzDJThwDW24hQ50gYCAZ3iBV+vJerPerY/laMkqdk7hD6zPH4RjmZQ=</latexit>

�(t)<latexit sha1_base64="Szqth62rUyn1jeubuKXYPXh1SkU=">AAACBHicbVDLTgIxFL2DL8QX6tJNIzHBDZlBE1mSuHGJiYAGJqRTOtDQdiZtx4RM2PoDbvUP3Bm3/oc/4HfYgVkoeJKbnJxzXzlBzJk2rvvlFNbWNza3itulnd29/YPy4VFHR4kitE0iHqn7AGvKmaRtwwyn97GiWAScdoPJdeZ3H6nSLJJ3ZhpTX+CRZCEj2FjpoT/CQuCqOR+UK27NnQOtEi8nFcjRGpS/+8OIJIJKQzjWuue5sfFTrAwjnM5K/UTTGJMJHtGepRILqv10/vAMnVlliMJI2ZIGzdXfEykWWk9FYDsFNmO97GXiv95QZwuXrpuw4adMxomhkiyOhwlHJkJZImjIFCWGTy3BRDH7PyJjrDAxNreSDcZbjmGVdOo176JWv72sNBt5REU4gVOoggdX0IQbaEEbCAh4hhd4dZ6cN+fd+Vi0Fpx85hj+wPn8AZe0mG8=</latexit>

`(t)<latexit sha1_base64="lsLN4nrfHXemYZOEAbnu/H09rQY=">AAACAnicbVC7SgNBFL0bXzG+opY2g0GITdiNgikDNpYRzAOSJczO3k2GzD6YmRVCSOcP2Oof2ImtP+IP+B3OJlto4oELh3Pui+Mlgitt219WYWNza3unuFva2z84PCofn3RUnEqGbRaLWPY8qlDwCNuaa4G9RCINPYFdb3Kb+d1HlIrH0YOeJuiGdBTxgDOqjdQdoBBVfTksV+yavQBZJ05OKpCjNSx/D/yYpSFGmgmqVN+xE+3OqNScCZyXBqnChLIJHWHf0IiGqNzZ4t05uTCKT4JYmoo0Wai/J2Y0VGoaeqYzpHqsVr1M/NfzVbZw5boOGu6MR0mqMWLL40EqiI5JlgfxuUSmxdQQyiQ3/xM2ppIybVIrmWCc1RjWSadec65q9fvrSrORR1SEMziHKjhwA024gxa0gcEEnuEFXq0n6816tz6WrQUrnzmFP7A+fwAMMZeV</latexit>

�(t)<latexit sha1_base64="ryt9nYc0+rCuTWiJJj1UezMFWCw=">AAACDHicbVDLSsNAFJ34rPXRqEs3g0Wom5JUwS4LblxWsA9oQ5lMJu3QmSTM3Agl9Bf8Abf6B+7Erf/gD/gdTtostPXAhcM598XxE8E1OM6XtbG5tb2zW9or7x8cHlXs45OujlNFWYfGIlZ9n2gmeMQ6wEGwfqIYkb5gPX96m/u9R6Y0j6MHmCXMk2Qc8ZBTAkYa2ZVhEEM2HBMpybwGlyO76tSdBfA6cQtSRQXaI/vbbKCpZBFQQbQeuE4CXkYUcCrYvDxMNUsInZIxGxgaEcm0ly0en+MLowQ4jJWpCPBC/T2REan1TPqmUxKY6FUvF//1Ap0vXLkOYdPLeJSkwCK6PB6mAkOM82RwwBWjIGaGEKq4+R/TCVGEgsmvbIJxV2NYJ91G3b2qN+6vq61mEVEJnaFzVEMuukEtdIfaqIMoStEzekGv1pP1Zr1bH8vWDauYOUV/YH3+ADR1m3c=</latexit>

✓(t)<latexit sha1_base64="IMzOhk8YSXLp/KDA5NVxEY3OvHk=">AAACBHicbVDLTgIxFO3gC/GFunTTSExwQ2bQRJYkblxiIqCBCemUDjS0nUl7x4RM2PoDbvUP3Bm3/oc/4HfYgVkoeJKbnJxzXzlBLLgB1/1yCmvrG5tbxe3Szu7e/kH58KhjokRT1qaRiPR9QAwTXLE2cBDsPtaMyECwbjC5zvzuI9OGR+oOpjHzJRkpHnJKwEoPfRgzIFU4H5Qrbs2dA68SLycVlKM1KH/3hxFNJFNABTGm57kx+CnRwKlgs1I/MSwmdEJGrGepIpIZP50/PMNnVhniMNK2FOC5+nsiJdKYqQxspyQwNsteJv7rDU22cOk6hA0/5SpOgCm6OB4mAkOEs0TwkGtGQUwtIVRz+z+mY6IJBZtbyQbjLcewSjr1mndRq99eVpqNPKIiOkGnqIo8dIWa6Aa1UBtRJNEzekGvzpPz5rw7H4vWgpPPHKM/cD5/ALZ1mII=</latexit>

r<latexit sha1_base64="cVLdqOIMLzJ5kor4GghebSD0Oik=">AAAB/HicbVDLSsNAFL3xWeur6tLNYBFclaQKdllw47IF+4A2lMnkph06eTAzEUqoP+BW/8CduPVf/AG/w0mbhbYeuHA45744XiK40rb9ZW1sbm3v7Jb2yvsHh0fHlZPTropTybDDYhHLvkcVCh5hR3MtsJ9IpKEnsOdN73K/94hS8Th60LME3ZCOIx5wRrWR2nJUqdo1ewGyTpyCVKFAa1T5HvoxS0OMNBNUqYFjJ9rNqNScCZyXh6nChLIpHePA0IiGqNxs8eicXBrFJ0EsTUWaLNTfExkNlZqFnukMqZ6oVS8X//V8lS9cua6DhpvxKEk1Rmx5PEgF0THJkyA+l8i0mBlCmeTmf8ImVFKmTV5lE4yzGsM66dZrznWt3r6pNhtFRCU4hwu4AgduoQn30IIOMEB4hhd4tZ6sN+vd+li2bljFzBn8gfX5AxYOlW0=</latexit>

pr2 ��

<latexit sha1_base64="Cg7KzyJpP54mbiO9Md+oZaWRLXY=">AAACEHicbVDLSsNAFJ34rPUV7dLNYBHcWJIq2GVBFy4r2Ac0sUwmk3bo5OHMjVBCf8IfcKt/4E7c+gf+gN/hpM1CWw9cOJxzXxwvEVyBZX0ZK6tr6xubpa3y9s7u3r55cNhRcSopa9NYxLLnEcUEj1gbOAjWSyQjoSdY1xtf5X73kUnF4+gOJglzQzKMeMApAS0NzIqjHiRk8r6Oz7BzzQSQ6cCsWjVrBrxM7IJUUYHWwPx2/JimIYuACqJU37YScDMigVPBpmUnVSwhdEyGrK9pREKm3Gz2/BSfaMXHQSx1RYBn6u+JjIRKTUJPd4YERmrRy8V/PV/lCxeuQ9BwMx4lKbCIzo8HqcAQ4zwd7HPJKIiJJoRKrv/HdEQkoaAzLOtg7MUYlkmnXrPPa/Xbi2qzUURUQkfoGJ0iG12iJrpBLdRGFE3QM3pBr8aT8Wa8Gx/z1hWjmKmgPzA+fwCVc5yz</latexit>

FIG. A.10. Illustration of ` and θ along a parametric curve γ .

Note that by definition, we have γ(t)>γ(t) = 1 and γ(t)>γ(t) = cosθ(t)√

γ(t)>γ(t). Plugging into(A.3), we can derive

¨(t) =1+ γ(t)>γ(t)− cos2 θ(t)

`(t)=

sin2θ(t)+ γ(t)>γ(t)

`(t). (A.4)

Now we find a lower bound on γ(t)>γ(t). Specifically, by Cauchy-Schwarz inequality, we have

γ(t)>γ(t)>−‖γ(t)‖2 ‖γ(t)‖2 |cos∠(γ(t),γ(t))|>− r

τ|cos∠(γ(t),γ(t))| .

The last inequality follows from ‖γ(t)‖2 61τ

(Niyogi et al., 2008) and ‖γ(t)‖2 6 r. We now need tobound ∠(γ(t),γ(t)), given ∠(γ(t), γ(t)) = θ(t) and γ(t) ⊥ γ(t). Consider the following optimizationproblem,

min a>x, (A.5)

subject to x>x = 1,

b>x = 0.

By assigning a= γ(t)‖γ(t)‖2

and b= γ(t)‖γ(t)‖2

, the optimal objective value is exactly the minimum of cos∠(γ(t),γ).Additionally, we can find the maximum of cos∠(γ(t),γ) by replacing the minimization in (A.5) by max-imization. We solve (A.5) by the Lagrangian method. More precisely, let

L (x,λ ,µ) =−a>x+λ (x>x−1)+µ(b>x).

We have the optimal solution x∗ satisfying ∇xL = 0, which implies x∗ = 12λ ∗ (a−µ∗b) with µ∗ and λ ∗

being the optimal dual variable. By the primal feasibility, we have µ∗ = a>b and λ ∗ =− 12

√1− (a>b)2.

Therefore, the optimal objective value is−√

1− (a>b)2. Similarly, the maximum is√

1− (a>b)2. Notethat a>b = cosθ(t), we then get

γ(t)>γ(t)>− rτ

sinθ(t).


Substituting into (A.4), we have the following lower bound

¨(t) =sinθ 2(t)+ γ(t)>γ(t)

`(t)>

1`(t)

(sin2

θ(t)− rτ

sinθ(t)).

Now combining with ¨(t) =−sinθ(t)θ(t), we can derive

θ(t)6− 1`(t)

(sinθ(t)− r

τ

). (A.6)

Inequality (A.6) has an important implication: When sinθ(t) > rτ

, as t increasing, θ(t) is monotonedecreasing until sinθ(t ′) = r

τfor some t ′ = t. Thus, we distinguish two cases depending on the value

of θ(0). Indeed, we only need to consider θ(0) ∈ [0,π/2]. The reason behind is that if θ(0) ∈ (π/2,π],we only need to set the initial velocity in the opposite direction.

Case 1: θ(0) ∈[0,arcsin r

τ

]. We claim that θ(t) ∈

[0,arcsin r

τ

]for all t 6 T . In fact, suppose there

exists some t1 6 T such that θ(t1) > arcsin rτ

. By the continuity of θ , there exists t0 < t1, such thatθ(t0) = arcsin r

τand θ(t)> arcsin r

τfor t ∈ [t0, t1]. This already gives us a contradiction:

θ(t0)< θ(t1) = θ(t0)+∫ t1

t0θ(t)dt

︸︷︷︸60

6 θ(t0).

Therefore, we have ˙(t)> cosarcsin rτ=√

1− r2

τ2 , and thus T 6 ∆

r√

1− r2τ2

.

Case 2: θ(0) ∈(

arcsin rτ,π/2

]. It is enough to show that θ(0) can be bounded sufficiently away

from π/2. Let γc,x ⊂M be a geodesic from ci to x. We analogously define θc,x and `c,x as for thegeodesic from x to y. Let Tr/2 = sup{t : `c,x(t)6 r/2−∆/r}, and denote z = γc,x(Tr/2). We must haveθc,x(Tr/2)∈ [0,π/2] and `c,x(Tr/2) = r/2−∆/r, otherwise there exists T ′r/2 > Tr/2 satisfying `c,x(T ′r/2)6

r/2. Denote Tx satisfying x = γc,x(Tx). We bound θc,x(Tx) as follows,

θc,x(Tx) = θc,x(Tr/2)+∫ Tx

Tr/2

θc,x(t)dt

6π

2−∫ Tx

Tr/2

1`c,x(t)

(sinθc,x(t)−

rτ

)dt.

If there exists some t ∈ (Tr/2,Tx] such that sinθc,x(t)6 rτ

, by the previous reasoning, we have sinθc,x(Tx)6rτ

. Thus, we only need to handle the case when sinθc,x(t)> rτ

for all t ∈ (Tr/2,Tx]. In this case, θc,x(t)is monotone decreasing, hence we further have

θc,x(Tx)6π

2−∫ Tx

Tr/2

1`c,x(t)

(sinθc,x(Tx)−

rτ

)dt

6π

2− (Tx−Tr/2)

1r

(sinθc,x(Tx)−

rτ

)

6π

2− 1

2

(sinθc,x(Tx)−

rτ

).


The last inequality follows from Tx−Tr/2 > r/2. Using the fact, sinx> 2π

x, we can derive

θc,x(Tx)6π

2− 1

2

(2π

θc,x(Tx)−rτ

)

=⇒ θc,x(Tx)6π

2

(π + r/τ

π +1

).

We can then set θ(0) = θc,x(Tx), and thus

cosθ(0)> cos(

π

2π + r/τ

π +1

)= cos

(π

2

(1− 1− r/τ

π +1

))

= sin(

π

21− r/τ

π +1

)

>1− r/τ

π +1.

Therefore, we have T 6 ∆

r cosθ(0) 6π+1

r(1−r/τ)∆ . By the choice of r < τ/2, we immediately have τ√τ2−r2

<

π+11−r/τ

. Hence, combining case 1 and case 2, we conclude

T 6π +1

r(1− r/τ)∆ .

Therefore, the function value f (x) on Ki is bounded by αiπ+1

r(1−r/τ)∆ . It suffices to let c=maxi αibi ‖Vi‖2,and we complete the proof. �

B.5 Characterization of the Size of the ReLU Network

Proof. We evenly split the error ε into 3 parts for Ai,1,Ai,2, and Ai,3, respectively. We pick η = ε

3CMso

that ∑CMi=1 Ai,1 6 ε

3 . The same argument yields δ = ε

3CM. Analogously, we can choose ∆ = r(1−r/τ)ε

3c(π+1)CM.

Finally, we pick ν = ∆

16B2D so that 8B2Dν < ∆ .Now we compute the number of layers, and the number of neurons and weight parameters in the

ReLU network identified by Theorem 3.2.

1. For the chart determination sub-network, 1∆ can be implemented by a single layer ReLU net-work. The approximation of the distance function d2

i can be implemented by a network ofdepth O

(log 1

ν

)and the number of neurons and weight parameters is at most O

(log 1

ν

). Plug-

ging in our choice of ν , we have the depth is no greater than c1(log 1

ε+ logD

)with c1 depend-

ing on d, f ,τ , and the surface area of M . The number of neurons and weight parameters isalso c′1

(log 1

ε+ logD

)except for a different constant. Note that there are D parallel networks

computing d2i for i = 1, . . . ,CM . Hence, the total number of neurons and weight parameters is

c′1CM D(log 1

ε+ logD

)with c′1 depending on d, f ,τ , and the surface area of M .

2. For the Taylor polynomial sub-network, φi can be implemented by a linear network with at mostDd weight parameters. To implement each fi, we need a ReLU network of depth c4 log 1

δ. The

number of neurons and weight parameters is c′4δ− d

s+α log 1δ

. Here c4,c′4 depend on s,d, fi ◦φ−1i .


Substituting δ = ε

3CM, we get the depth is c2 log 1

εand the number of neurons and weight param-

eters is c′2ε− d

s+α log 1ε

. There are totally CM parallel fi’s, hence the total number of neurons and

weight parameters is c′2CM ε− d

s+α log 1ε

with c′2 depending on d,s, fi ◦φ−1i ,τ , and the surface area

of M .

3. For the product sub-network, the analysis is similar to the chart determination sub-network. Thedepth is O

(log 1

η

), and the number of neurons and weight parameters is O

(log 1

η

). The choice

of η yields the depth is c3 log 1ε

, and the number of neurons and weight parameters is c′3 log 1ε

.There are CM parallel pairs of outputs from the chart determination and the Taylor polynomialsub-networks. Hence, the total number of weight parameters is c′3CM log 1

εwith c′3 depending on

d,τ , and the surface area of M .

Combining these 3 sub-networks, we see the depth of the full network is c(log 1

ε+ logD

)for some

constant c depending on d,n, f ,τ , and the surface area of M . The total number of neurons and weightparameters is c′

(ε− d

s+α log 1ε+D log 1

ε+D logD

)for some constant c′ depending on d,s, f ,τ , and the

surface area of M . �

C. Proof of Statistical Recovery of ReLU Network (Theorem 3.1)

C.1 Proof of Lemma 4.1

Proof. T1 essentially reflects the bias of estimating f0:

T1 = E

[2n

n

∑i=1

( fn(xi)− f0(xi)−ξi +ξi)2

]

=2nE

[n

∑i=1

( fn(xi)− f0(xi)−ξi)2 +2ξi( fn(xi)− f0(xi)−ξi)+ξ

2i

]

(i)6

2nE

[n

∑i=1

( fn(xi)− f0(xi)−ξi)2 +2ξi fn(xi)

]

=2nE

[n

∑i=1

( fn(xi)− yi)2 +2ξi fn(xi)

]

=2nE

[inf

f∈F (R,κ,L,p,K)

n

∑i=1

( f (xi)− yi)2 +2ξi fn(xi)

]

(ii)6 2 inf

f∈F (R,κ,L,p,K)E[( f (x)− f0(x))2]+E

[4n

n

∑i=1

ξi fn(xi)

], (A.1)

where (i) holds since E[ξi f0(xi)] = 0, and (ii) holds due to Jensen’s inequality and f being inde-pendent of xi’s. Now we need to bound E

[1n ∑

ni=1 ξi fn(xi)

]. We discretize the class F (R,κ,L, p,K)

into F ∗(R,κ,L, p,K) = { f ∗i }N (δ ,F (R,κ,L,p,K),‖·‖∞)i=1 , where N (δ ,F (R,κ,L, p,K),‖·‖

∞) denotes the δ -

covering number with respect to the `∞ norm. Accordingly, there exists f ∗ such that ‖ f ∗− fn‖∞ 6 δ .


Denote ‖ fn− f0‖2n =

1n ∑

ni=1( fn(xi)− f0(xi))

2. Then we have

E

[1n

n

∑i=1

ξi fn(xi)

]= E

[1n

n

∑i=1

ξi( fn(xi)− f ∗(xi)+ f ∗(xi)− f0(xi))

]

(i)6 E

[1n

n

∑i=1

ξi( f ∗(xi)− f0(xi))

]+δσ

= E[‖ f ∗− f0‖n√

n∑

ni=1 ξi( f ∗(xi)− f0(xi))√

n‖ f ∗− f0‖n

]+δσ

(ii)6√

2E

[‖ fn− f0‖n +δ√

n∑

ni=1 ξi( f ∗(xi)− f0(xi))√

n‖ f ∗− f0‖n

]+δσ .

Here (i) is obtained by applying Holder’s inequality to ξi( fn(xi)− f ∗(xi)) and invoking the Jensen’sinequality:

E

[1n

n

∑i=1

ξi( fn(xi)− f ∗(xi))

]6 E

[1n

n

∑i=1|ξi|∥∥∥ f ∗− fn

∥∥∥∞

]

61n

n

∑i=1

E[|ξi|]δ

61n

n

∑i=1

√E[|ξi|2]δ

6 δσ .

Step (ii) holds, since by invoking the inequality 2ab6 a2 +b2, we have

‖ f ∗− f0‖n =

√1n

n

∑i=1

( f ∗(xi)− fn(xi)+ fn(xi)− f0(xi))2

6

√2n

n

∑i=1

( f ∗(xi)− fn(xi))2 +( fn(xi)− f0(xi))2

6

√2n

n

∑i=1

[δ 2 + fn(xi)− f0(xi))2

]

6√

2∥∥∥ fn− f0

∥∥∥n+√

2δ .

Now observe that ∑ni=1 ξi( f ∗(xi)− f0(xi))√

n‖ f ∗− f0‖n6 max j

∑ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n , and each∑

ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n is sub-

guassian with parameter σ2. Given samples Sn, the quantity ‖ fn− f0‖n+δ√n is fixed. We then bound


E[

max j∑

ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n

∣∣∣Sn

]by utilizing the moment generating function: For any t, we have

E

[max

j

∑ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n

∣∣∣Sn

]=

1t

logexp

(tE

[max

j

∑ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n

∣∣∣Sn

])

61t

logE

[exp

(t max

j

∑ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n

)∣∣∣Sn

]

61t

logE

[∑

jexp

(t∑

ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n

)∣∣∣Sn

]

61t

log(N (δ ,F (R,κ,L, p,K),‖·‖

∞)exp(t2

σ2)).

Taking t =√

σ−1 logN (δ ,F (R,κ,L, p,K),‖·‖∞), we have

E

[max

j

∑ni=1 ξi( f ∗j (xi)− f0(xi))√

n‖ f ∗j − f0‖n

]6 2σ

√logN (δ ,F (R,κ,L, p,K),‖·‖

∞).

This in turn yields

E

[1n

n

∑i=1

ξi fn(xi)

]6 2√

2σE[‖ fn− f0‖n +δ

]√ logN (δ ,F (R,κ,L, p,K),‖·‖∞)

n+δσ .

Substituting back into (A.1), we have

E[‖ fn− f0‖2

n

]6 inf

f∈F (R,κ,L,p,K)E[( f (x)− f0(x))2]+2δσ

+4√

2σE[‖ fn− f0‖n +δ

]√ logN (δ ,F (R,κ,L, p,K),‖·‖∞)

n6 inf

f∈F (R,κ,L,p,K)E[( f (x)− f0(x))2]

+4√

2σ

√E[‖ fn− f0‖2

n

]√ logN (δ ,F (R,κ,L, p,K),‖·‖∞)

n

+4√

2σδ

√logN (δ ,F (R,κ,L, p,K),‖·‖

∞)

n+2δσ .

We only need to consider logN (δ ,F (R,κ,L, p,K),‖·‖∞) < n. Otherwise E

[‖ fn− f0‖2

n

]is naturally

bounded by 4R2, and thus the upper bound is trivial for large n. Invoking the fact that x2 6 2ax+

b implies x2 6 4a2 + 2b. Letting x2 = E[‖ fn− f0‖2

n

], a = 2

√2σ

√logN (δ ,F (R,κ,L,p,K),‖·‖∞)

n and b =

inf f∈F (R,κ,L,p,K)E[( f (x)− f0(x))2

]+(4√

2+2)σδ , we have

T1 6 2E[∥∥∥ fn− f0

∥∥∥2

n

]6 4 inf

f∈F (R,κ,L,p,K)E[( f (x)− f0(x))2]+64σ

2 logN (δ ,F (R,κ,L, p,K),‖·‖∞)

n

+(16√

2+8)σδ .

�


C.2 Proof of Lemma 4.2

Proof. In the following proofs, we use subscript to denote taking expectation with respect to a certainrandom variable. For example, Ex, Eξ and E(x,y) denote expectations with respect to x, the noise, andthe joint distribution of (x,y), respectively. We rewrite T2 as

T2 = E

[Ex[g(x)|Sn]−

2n

n

∑i=1

g(xi)

]

= 2E

[12Ex[g(x)|Sn]−

1n

n

∑i=1

g(xi)

]

= 2E

[Ex[g(x)|Sn]−

1n

n

∑i=1

g(xi)−12Ex[g(x)|Sn]

].

We lower bound Ex[g(x)|Sn] by its second moment:

Ex[g2(x)|Sn] = Ex

[(fn(x)− f0(x)

)4 ∣∣Sn

]

= Ex

[(fn(x)− f0(x)

)2g(x)

∣∣Sn

]

6 Ex[4R2g(x)

∣∣Sn].

The last inequality follows from∣∣∣ fn(x)− f0(x)

∣∣∣6 2R. Now we cast T2 into

T2 6 2E

[Ex[g(x)|Sn]−

1n

n

∑i=1

g(xi)−1

8R2Ex[g2(x)|Sn]

]. (A.2)

Introducing the second moment allows us to establish a fast convergence of T2. Specifically, we denotexi’s as independent copies of xi’s following the same distribution. We also denote

G ={

g(x) = ( f (x)− f0(x))2 ∣∣ f ∈F (R,κ,L, p,K)}

as the function class induced by F (R,κ,L, p,K). Then we rewrite (A.2) as

T2 6 2E

[supg∈G

Ex[g(x)]−1n

n

∑i=1

g(xi)−1

8R2Ex[g2(x)]

]

6 2E

[supg∈G

Ex

[1n

n

∑i=1

g(xi)−g(xi)

]− 1

16R2Ex,x[g2(x)+g2(x)]

]

(i)6 2Ex,x,U

[supg∈G

1n

n

∑i=1

Ui (g(xi)−g(xi))−1

16R2Ex,x[g2(x)+g2(x)]

].

Here Ui’s are i.i.d. Rademacher random variables, i.e., P(Ui = 1) = P(Ui = −1) = 12 , independent of

samples xi’s and xi’s, and the inequality (i) follows from symmetrization.


We discretize G with respect to the `∞ norm. The δ -covering number is denoted as N (δ ,G ,‖·‖∞)

and the elements in the covering is denoted as G ∗ = {g∗i }N (δ ,G ,‖·‖∞)i=1 , that is, for any g ∈ G , there exists

a g∗ satisfying ‖g−g∗‖∞6 δ .

We replace g ∈ G by g∗ ∈ G ∗ in bounding T2, which then boils down to deriving concentrationresults on a finite concept class. Specifically, for g∗ satisfying ‖g−g∗‖

∞6 δ , we have

Ui (g(xi)−g(xi)) =Ui (g(xi)−g∗(xi)+g∗(xi)−g∗(xi)+g∗(xi)−g(xi))

=Ui (g(xi)−g∗(xi))+Ui (g∗(xi)−g∗(xi))+Ui (g∗(xi)−g(xi))

6Ui (g∗(xi)−g∗(xi))+2δ .

We also have

g2(x)+g2(x) =[g2(x)− (g∗)2(x)

]+[(g∗)2(x)+(g∗)2(x)

]−[(g∗)2(x)−g2(x)

]

= (g∗)2(x)+(g∗)2(x)+(g(x)−g∗(x))(g(x)+g∗(x))+(g∗(x)−g(x))(g∗(x)+g(x))

> (g∗)2(x)+(g∗)2(x)−|g(x)−g∗(x)| |g(x)+g∗(x)|− |g∗(x)−g(x)| |g∗(x)+g(x)|> (g∗)2(x)+(g∗)2(x)−2Rδ −2Rδ .

Plugging the above two items into T2, we upper bound T2 as

T2 6 2Ex,x,U

[sup

g∗∈G ∗1n

n

∑i=1

Ui (g∗(xi)−g∗(xi))−1

16R2Ex,x[(g∗)2(x)+(g∗)2(x)]

]+

(4+

12R

)δ

= 2Ex,x,U

[max

j

1n

n

∑i=1

Ui(g∗j(xi)−g∗j(xi)

)− 1

16R2Ex,x[(g∗j)2(x)+(g∗j)

2(x)]

]+

(4+

12R

)δ .

Denote h j(i) =Ui(g∗j(xi)−g∗j(xi)). By symmetry, it is straightforward to see E[h j(i)] = 0. The varianceof h j(i) is computed as

Var[h j(i)] = E[h2

j(i)]= E

[U2

i(g∗j(xi)−g∗j(xi)

)2] (i)6 2E

[(g∗j)

2(xi)+(g∗j)2(xi)

].

The last inequality (i) utilizes the identity (a− b)2 6 2(a2 + b2). Therefore, we derive the followingupper bound for T2,

T2 6 2E

[max

j

1n

n

∑i=1

h j(i)−1

32R21n

n

∑i=1

Var[h j(i)]

]+

(4+

12R

)δ .

We invoke the moment generating function to bound T2. Note that we have ‖h j‖∞ 6 (2R)2. Then by


Taylor expansion, for 0 < t < 34R2 , we have

E[exp(th j(i))] = E

[1+ th j(i)+

∞

∑k=2

tkhkj(i)

k!

]

6 E

[1+ th j(i)+

∞

∑k=2

tkh2j(i)(4R2)k−2

2×3k−2

]

= E

[1+ th j(i)+

t2h2j(i)

2

∞

∑k=2

tk−2(4R2)k−2

3k−2

]

= E

[1+ th j(i)+

t2h2j(i)

21

1−4tR2/3

]

= 1+ t2 Var[h j(i)]1

2−8tR2/3(i)6 exp

(Var[h(i)]

3t2

6−8tR2

). (A.3)

Step (i) follows from the fact 1+ x 6 exp(x) for x > 0. Given (A.3), we proceed to bound T2. To easethe presentation, we temporarily neglect

(4+ 1

2R

)δ term and denote T ′2 = T2−

(4+ 1

2R

)δ . Then for

0 < t/n < 34R2 , we have

exp(

tT ′22

)= exp

(tE

[max

j

1n

n

∑i=1

h j(i)−1

32R21n

n

∑i=1

Var[h j(i)]

])

(i)6 E

[exp

(t sup

h

1n

n

∑i=1

h(i)− 132R2

1n

n

∑i=1

Var[h(i)]

)]

6 E

[∑h

exp

(tn

n

∑i=1

h(i)− 132R2

tn

n

∑i=1

Var[h(i)]

)]

(ii)6 E

[∑h

exp

(n

∑i=1

Var[h(i)]3(t/n)2

6−8tR2/n− 1

32R2tn

Var[h(i)]

)]

= E

[∑h

exp

(n

∑i=1

tn

Var[h(i)](

3t/n6−8tR2/n

− 132R2

))].

Step (i) follows from Jensen’s inequality, and step (ii) invokes (A.3). We now choose t so that 3t/n6−8tR2/n−

132R2 = 0, which yields t = 3n

52R2 < 3n4R2 . Substituting our choice of t into exp(tT ′2/2), we have

tT ′226 log∑

hexp(0) =⇒ T ′2 6

2t

logN (δ ,G ,‖·‖∞) =

52R2

3nlogN (δ ,G ,‖·‖

∞).

To complete the proof, we relate the covering number of G to that of F (R,κ,L, p,K). Consider anyg1,g2 ∈ G with g1 = ( f1− f0)

2 and g2 = ( f2− f0)2, respectively, for f1, f2 ∈F (R,κ,L, p,K). We can


derive

‖g1−g2‖∞= sup

x

∣∣∣( f1(x)− f0(x))2− ( f2(x)− f0(x))2∣∣∣

= supx| f1(x)− f2(x)| | f1(x)+ f2(x)−2 f0(x)|

6 4R‖ f1− f2‖∞.

The above characterization immediately implies N (δ ,G ,‖·‖∞) 6 N (δ/4R,F (R,κ,L, p,K),‖·‖

∞).

Therefore, we derive the desired upper bound on T2:

T2 652R2


∞)+

(4+

12R

)δ .

�

C.3 Proof of Theorem 3.1

Proof. We dilate E[Ex

(fn(x)− f0(x)

)2 ∣∣Sn

]using its empirical counterpart:

E[Ex

[(fn(x)− f0(x)

)2 ∣∣Sn

]]= E

[2n

n

∑i=1

( fn(xi)− f0(xi))2

]

︸︷︷︸T1

+E[Ex

[(fn(x)− f0(x)

)2 ∣∣Sn

]]−E

[2n

n

∑i=1

( fn(xi)− f0(xi))2

]

︸︷︷︸T2

.

Combining the upper bounds on T1 and T2, we can derive

E[Ex

(fn(x)− f0(x)

)2 ∣∣Sn

]6 4 inf

f∈F (R,κ,L,p,K)E[( f (x)− f0(x))2]

+52R2 +192σ2


∞)

+

(4+

12R

+(16√

2+8)σ)

δ .

The remaining step is to characterize the covering number N (δ/4R,F (R,κ,L, p,K),‖·‖∞). To con-

struct a covering for F (R,κ,L, p,K), we discretize each weight parameter by a uniform grid with gridsize h. Recall we write f ∈F (R,κ,L, p,K) as f =WL ·ReLU(WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1)+bL. Let f , f ′ ∈F with all the weight parameters at most h from each other. Denoting the weight matri-ces in f , f ′ as WL, . . . ,W1,bL, . . . ,b1 and W ′L, . . . ,W

′1,b′L, . . . ,b′1, respectively, we bound the `∞ difference


‖ f − f ′‖∞

as∥∥ f − f ′

∥∥∞=∥∥WL ·ReLU(WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1)+bL

− (W ′L ·ReLU(W ′L−1 · · ·ReLU(W ′1x+b′1) · · ·+b′L−1)−b′L)∥∥

∞

6∥∥bL−b′L

∥∥∞+∥∥WL−W ′L

∥∥1 ‖WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1‖∞

+‖WL‖1

∥∥WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1− (W ′L−1 · · ·ReLU(W ′1x+b′1) · · ·+b′L−1)∥∥

∞

6 h+hp‖WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1‖∞

+κ p∥∥WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1− (W ′L−1 · · ·ReLU(W ′1x+b′1) · · ·+b′L−1)

∥∥∞.

We derive the following bound on ‖WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1‖∞:

‖WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1‖∞6 ‖WL−1(· · ·ReLU(W1x+b1) · · ·)‖∞

+‖bL−1‖∞

6 ‖WL−1‖1 ‖WL−2(· · ·ReLU(W1x+b1) · · ·)+bL−2‖∞+κ

6 κ p‖WL−2(· · ·ReLU(W1x+b1) · · ·)+bL−2‖∞+κ

(i)6 (κ p)L−1B+κ

L−3

∑i=0

(κ p)i

6 (κ p)L−1B+κ(κ p)L−2,

where (i) is obtained by induction and ‖x‖∞6 B. The last inequality holds, since κ p > 1. Substituting

back into the bound for ‖ f − f ′‖∞

, we have∥∥ f − f ′

∥∥∞6 κ p

∥∥WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1− (W ′L−1 · · ·ReLU(W ′1x+b′1) · · ·+b′L−1)∥∥

∞

+h+hp[(κ p)L−1B+κ(κ p)L−2]

6 κ p∥∥WL−1 · · ·ReLU(W1x+b1) · · ·+bL−1− (W ′L−1 · · ·ReLU(W ′1x+b′1) · · ·+b′L−1)

∥∥∞

+h(pB+2)(κ p)L−1

(i)6 (κ p)L−1∥∥W1x+b1−W ′1x−b′1

∥∥∞+h(L−1)(pB+2)(κ p)L−1

6 hL(pB+2)(κ p)L−1,

where (i) is obtained by induction. We choose h satisfying hL(pB+2)(κ p)L−1 = δ

4R . Then discretizingeach parameter uniformly into κ/h grids yields a δ

4R -covering on F . Therefore, the covering number isupper bounded by

N (δ/4R,F (R,κ,L, p,K),‖·‖∞)6

(κ

h

)# of parameters. (A.4)

By our choice of F (R,κ,L, p,K), there exists a network which yields f satisfying ‖ f − f0‖∞6 ε . Such

a network network consists of O(log 1

ε

)layers and O

(ε− d

s+α log 1ε

)weight parameters. Then we have

E[Ex

(fn(x)− f0(x)

)2 ∣∣Sn

]6 O

(4ε

2 +R2 +σ2

nε− d

s+α log1ε

logκ

h+δ

).


Now we choose ε to satisfy ε2 = 1n ε− d

s+α , which gives ε = n−s+α

d+2(s+α) . It suffices to pick δ = 1n , then

we have the desired estimation error bound

E[Ex

(fn(x)− f0(x)

)2 ∣∣Sn

]6 4n−

2s+2α

d+2s+2α +(R2 +σ2)n−

2s+2α

d+2s+2α logn logκ

h+

1n

6 c(R2 +σ2)n−

2s+2α

d+2s+2α log3 n,

where c depends on d,s, f0,τ, logD. �

42 of 42 LIST OF FIGURES

List of Figures

1 Practical network sizes for the ImageNet data set (Tan and Le, 2019) versus the requiredsize predicted by existing theories (Yarotsky, 2017). . . . . . . . . . . . . . . . . . . . 4

2 Manifolds with large and small reaches. . . . . . . . . . . . . . . . . . . . . . . . . . 73 The ReLU network identified by Theorem 3.2. . . . . . . . . . . . . . . . . . . . . . . 94 Curvature decides the number of charts: smaller reach requires more chart. . . . . . . . 145 Projecting x j using a matched chart (blue) (U j,φ j), and an unmatched chart (green)

(Ui,φi). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Chart determination utilizes the composition of approximated distance function d2

i andthe indicator function 1∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Locally approximate f in each chart (Ui,φi) using Taylor polynomials. . . . . . . . . . 168 A geometric illustration of θ and `. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19A.9 Illustration of the construction of ζm on the k-th coordinate. . . . . . . . . . . . . . . . 26A.10 Illustration of ` and θ along a parametric curve γ . . . . . . . . . . . . . . . . . . . . . 30

nonparametric regression on low-dimensional manifolds

Documents