andrew r. barron yale university department of statisticsarb4/presentations/barronosijek... ·...
TRANSCRIPT
![Page 1: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/1.jpg)
High-Dimensional Neural Networks: Statistical andComputational Properties
Andrew R. Barron
YALE UNIVERSITY
DEPARTMENT OF STATISTICS
Presentation, 27 September, 2016
International Conference on Operational Research
KOI 2016, Osijek, Croatia
Joint work with Jason Klusowski
![Page 2: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/2.jpg)
Outline
Flexible high-dimensional function estimation usingsinusoid, sigmoid, ramp or polynomial activation functions
Approximation and estimation bounds to reveal the effectof dimension, model size, and sample size
Computation with greedy term selectionAdaptive Annealing Method (for general designs) to guideparameter searchNonlinear Power Method (for specific designs) to improveupon tensor methods for parameter estimation
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 3: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/3.jpg)
Example papers for some of what is to follow
Papers illustrating my background addressing these questionsof high dimensionality (available from www.stat.yale.edu)
A. R. Barron, R. L. Barron (1988). Statistical learningnetworks: a unifying view. Computing Science & Statistics:Proc. 20th Symp on the Interface, ASA, p.192-203.A. R. Barron (1993). Universal approximation bounds forsuperpositions of a sigmoidal function. IEEE Transactionson Information Theory, Vol.39, p.930-944.A. R. Barron, A. Cohen, W. Dahmen, R. DeVore (2008).Approximation and learning by greedy algorithms. Annalsof Statistics, Vol.36, p.64-94.J. M. Klusowski, A. R. Barron (2016) Risk bounds forhigh-dimensional ridge function combinations includingneural networks, Submitted to the Annals of Statistics.arXiv:1607.01434v1
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 4: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/4.jpg)
Data Setting
Data: (X i ,Yi), i = 1,2, . . . ,n
Inputs: explanatory variable vectors (arbitrary dependence)
X i = (Xi,1,Xi,2, . . . ,Xi,d )
Domain: Cube in Rd
Random design: independent X i ∼ P
Output: response variable Yi in RBounded or subgaussian
Relationship: E [Yi |X i ] = f (X i) as in:Perfect observation: Yi = f (X i )
Noisy observation: Yi = f (X i ) + εi with εi indep
Function: f (x) unknown
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 5: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/5.jpg)
Activation functions
Activation functions denoted φ(z) or g(z)
Piecewise constant: 1{z−b≥0} or sgn(z−b)
Sigmoid: (ez − e−z)/(ez + e−z)
Linear spline, ramp: (z − b)+
Sinusoidal: cos(2πf z), sin(2πf z)
Polynomial: standard z`, Hermite H`(z)
Multivariate Activation functionsbuilt from products or from ridge forms: φ(aTx)
constructed using univariate function φinternal parameter vector a of dimension d .
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 6: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/6.jpg)
Product and Ridge Bases
Product bases: for polynomials, sinusoids, splinesusing continuous powers, frequencies or thresholds
φ(x ,a) = φ(x1,a1)φ(x2,a2) · · ·φ(xd ,ad )
Ridge bases: for projection pursuit regression, sinusoids,neural networks and polynomial networks:
φ(x ,a) = φ(aT x) = φ(a1x1 + a2x2 + . . .+ adxd )
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 7: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/7.jpg)
Flexible multivariate function modeling
Internally parameterized models & nonlinear least squares
Functions fm(x) =∑m
j=1 ckφ(x ,ak ) in the span of aparameterized dictionary Φ = {φa(·) = φ(·,a) : a ∈ A}with parameter set A ⊂ Rd
Flexible function approximation
Statistically challenging
Computationally challenging
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 8: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/8.jpg)
Notation
Response vector: Y = (Yi)ni=1 in Rn
Dictionary vectors: Φ(n) ={
(φ(X i ,a))ni=1 : a ∈ A
}⊂ Rn
Sample squared norm: ‖f‖2(n) = 1n∑n
i=1 f 2(X i)
Population squared norm: ‖f‖2 =∫
f 2(x)P(dx)
Normalized dictionary condition: ‖φ‖ ≤ 1 for φ ∈ Φ
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 9: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/9.jpg)
Flexible m−term nonlinear optimization
Impractical complete nonlinear least squares optimization
Sample version
fm achieves min(aj ,cj )
mj=1
‖Y −m∑
j=1
cj φaj‖2(n)
Population version
fm achieves min(aj ,cj )
mj=1
‖f −m∑
j=1
cj φaj‖2
Optimization of (aj , cj)mj=1 in R(d+1)m.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 10: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/10.jpg)
GREEDY OPTIMIZATIONS
Optimize one term at a time
Step 1: Choose a1, c1 to produce a single best term:
sample version: mina,c ‖Y − cφa‖2(n)
population version: mina,c ‖f − cφa‖2
Step m > 1: Arrange
fm = α fm−1 + c φa
with am, cm, αm providing the term with best improvement:
sample version: mina,c,α ‖Y − αfm−1 − cφa‖2(n)
population version: mina,c,α ‖f − αfm−1 − cφa‖2
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 11: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/11.jpg)
Acceptable variants of greedy optimization
Step m > 1: Arrange
fm = α fm−1 + c φa
with am, cm, αm providing the term with best improvement:least squares: mina,c,α ‖Y − αfm−1 − cφa‖2
(n)
Acceptable variantsInner-product maximization: With Resi = Yi − fm−1(X i )choose am to achieve maxa
∑ni=1 Resi φ(X i ,a) to within a
constant factor
Foward stepwise selection: Given a1, . . . ,am−1, choose amto obtain mina d(Y , span{φθ1
. . . φam−1, φa})
Orthogonal matching pursuit: Project onto span afterchoosing inner-product maximizer.
`1 Penalization: Update vm = αmvm−1 + cm. Choose αm, cm
to achieve minα,c ‖Y − αfm−1 − cφa‖2(n) + λ[αvm−1 + c]
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 12: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/12.jpg)
Basic m−term approximation and computation boundsFor complete or greedy approximation (B. 1993, Lee et al 1995)
Population version:
‖f − fm‖ ≤‖f‖Φ√
mand moreover
‖f − fm‖2 ≤ infg
{‖f − g‖2 +
2‖g‖2Φ
m
}Sample version:
‖Y − fm‖2(n) ≤ ‖Y − f‖2
(n) +2‖f‖2
Φ
m
Optimization with `1 penalization:
‖Y − fm‖2(n) + λ‖fm‖Φ ≤ inf
g
{‖Y − f‖2
(n) + λ‖g‖Φ +2‖g‖2
Φ
m
}
Here ‖g‖Φ is the minimal `1 norm of coefficients of g in span of Φ
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 13: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/13.jpg)
The norm of f with respect to the dictionary Φ
Minimal `1 norm on coefficients in approximation of f
‖f‖Φ = limε→0
inf{∑
j
|cj | : ‖∑
j
cjφaj− f‖ ≤ ε
}called the variation of f with respect to Φ (B. 1991)
‖f‖Φ = VΦ(f ) = inf{V : f/V ∈ closure(conv(±Φ))}
It appears in the bound ‖f − fm‖ ≤ ‖f‖Φ√m
Later also called the atomic norm of f with respect to Φ
In the case of the signum activation function it matches theminimal range of neural nets arbitrarily well approximating f
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 14: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/14.jpg)
Greedy proof of the approximation bound:
Consider the case ‖f‖Φ = 1Take Φ to be closed under sign changes.The minφ is not more than aveφTake average with respect to the weights representing f
‖f − fm‖2 ≤ minφ‖f − (1− λ)fm−1 − λφ‖2
≤ aveφ‖f − (1− λ)fm−1 − λφ‖2
= (1− λ)2‖f − fm−1‖2 + λ2
Bound follows by induction with λ = 1/m
‖f − fm‖2 ≤1m
Jones (AS 1992), B. (IT 1993)extensions: Lee et al (IT 1995), DeVore et al (AS 2008)
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 15: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/15.jpg)
Relating the variation to a spectral norm of f
Neural net approximation using the Fourier representation
f (x) =
∫ei ω·x f (ω) dω.
L1 spectral norm: ‖f‖spectrum,s =∫
Rd |f (ω)| ‖ω‖s1 dωLet Φ be the dictionary for sinusoids, sigmoids or ramps.With unit cube domain for x , the variation ‖f‖Φ satisfies
‖f‖sinusoid = ‖f‖spectrum,0
‖f‖sigmoid ≤ ‖f‖spectrum,1
‖f‖ramp ≤ ‖f‖spectrum,2
For sinusoid, sigmoid cases the parameter set is A = Rd
For the ramp case A = {a : ‖a‖1 ≤ 1}.As we said, this ‖f‖Φ appears in the approximation bound
‖f − fm‖ ≤ ‖f‖Φ/√
m.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 16: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/16.jpg)
Statistical Risk
Statistical risk E‖fm − f‖2 = E(fm(X )− f (X ))2
Expected squared generalization error on new X ∼ P
Minimax optimal risk bound, via information theory
E‖fm − f‖2 ≤ ‖fm − f‖2 + cmn
log N(Φ, δ)
where log N(Φ, δ) is the metric entropy of Φ at δm = 1/m
With greedy optimization using `1 penalty or suitable m
E‖f − f‖2 ≤ ming.m
{‖g − f‖2 +
‖g‖2Φm
+ cmn
log N(Φ, δm)
}achieves ideal approximation, complexity tradeoff.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 17: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/17.jpg)
Statistical Risk with MDL Selection
Again, general risk bound, using metric entropy log N(Φ, δ)
E‖f − f‖2 ≤ ming.m
{‖g − f‖2 +
‖g‖2Φm
+ cmn
log N(Φ, δm)
}.
For greedy algorithm with Minimum Description Length m
minm
{‖Y − fm‖2(n) + 2c
mn
log N(Φ, δ)}
Performs as well as if the best m∗ were known in advance.
`1 penalty retains MDL interpretation and risk (B.,Huang,Li, Luo,2008)
Risk bound specializes when ‖f‖Φ is finite
E‖f − f‖2 ≤ minm
{‖f‖2Φ
m+ c
mn
log N(Φ, δm)
}≤ c ‖f‖Φ
(1n
log N(Φ, δm∗)
)1/2
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 18: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/18.jpg)
Statistical risk for neural nets
Specialize the metric entropy log N(Φ, δ) (Klusowski, Barron 2016).
It is not more than order d log(1/δ) for Lipshitz activationfunctions such as sigmoids.
With `1 constrained internal parameters, as in ramp casewith finite ‖f‖spectrum,2, also not more than order (1/δ) log d
Risk bound is ‖f‖Φ[dn log(n/d)]1/2 or ‖f‖4/3
Φ [ 1n log d ]1/3,
whichever is smallest.
The [(log d)/n]1/3 is for the no-noise case. For the casewith noise that is sub-exponential and sub-Gaussian, mayreplace it by [(log d)/n]1/4, to within log n factors.
Implication: Can allow d >> n provided n is large enoughto accomodate the worse exponent of 1/3 in place of 1/2.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 19: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/19.jpg)
Confronting the computational challenge
Greedy searchReduces dimensionality of optimization from md to just dObtain a current am achieving within a constant factor of themaximum of
Jn(a) =1n
n∑i=1
Ri φ(X i ,a).
This surface can still have many maxima.We might get stuck at a spurious local maximum.
New computational strategies identify approximate maximawith high probability
1 Adaptive Annealing
2 Third-order Tensor Methods (pros and cons)
3 Nonlinear Power Methods
These are stochastically initialized search methods
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 20: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/20.jpg)
Optimization path for bounded ridge bases
Adaptive Annealing:A more general approach to seek approx optimization of
J(a) =∑n
i=1ri φ(aT X i)
recent & current work with Luo, Chatterjee, KlusowskiSample at from the evolving density
pt (a) = e t J(a)−ct p0(a)
along a sequence of values of t from 0 to tfinal
use tfinal of order (d log d)/nInitialize with a0 drawn from a product prior p0(a):
Uniform[−1,1] for each coefficient in bounded a caseNormal(0,I) or product of Cauchy in unbounded a case
Starting from the random a0 define the optimization path atsuch that its distribution tracks the target density pt .
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 21: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/21.jpg)
Optimization path
Adaptive Annealing: Arrange at from the evolving density
pt (a) = etJ(a)−ct p0(a)
with a0 drawn from p0(a)
State evolution with vector-valued change function Gt (a):
at+h = at − h Gt (at )
Better state evolution: at+h = a∗ is the solution to
at = a∗ + h Gt (a∗),
with small step-size h, such that a + h Gt (a) is invertiblewith a positive definite Jacobian, and solves equations forthe evolution of pt (a).As we will see there are many such change functionsGt (a), though not all are nice.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 22: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/22.jpg)
Boundary requirements
Boundary requirementsSuppose a is restricted to a bounded domain A withsmooth boundary ∂A
For a ∈ ∂A, let va be outward normal to ∂A at a.Either Gt (a) = 0 (vanishing at the boundary)
Or Gt (a)T va ≥ 0 (to move inward, not move outside)
Likewise, for a near ∂A , if Gt (a) approaches 0, it shoulddo so at order not larger than the distance of a from ∂A.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 23: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/23.jpg)
Solve for the change Gt to track the density pt
Density evolution: by the Jacobian rule
pt+h(a) = pt(a + h Gt (a)
)det(I + h∇GT
t (a))
Up to terms of order h
pt+h(a) = pt (a) + h[(Gt (a))T ∇pt (a) + pt (a)∇T Gt (a)
]In agreement for small h with the partial diff equation
∂
∂tpt (a) = ∇T [Gt (a)pt (a)
]The right side is GT
t (a)∇pt (a) + pt (a)∇T Gt (a). Dividing bypt (a) it is expressed in the log density form
∂
∂tlog pt (a) = ∇T Gt (a) + GT
t (a)∇ log pt (a)
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 24: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/24.jpg)
Five candidate solutions
Five solutions to the partial differential equation at time t
∇T [G(a)pt (a)]
= ∂t pt (a)
1 Solution in which G(a)p(a) is a gradient
2 Solution using pairs of coefficients
3 Solution with j random, ∂aj [Gj(a)p(a)] provides ∂tpt (a)
4 Solution in which G(a) is a gradient
5 Approximate solutions expressed in terms of ui = X Ti a.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 25: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/25.jpg)
Candidate solution 1.
Solution of smallest L2 norm of Gt (a)pt (a) at a specific t .
Let Gt (a)pt (a) = ∇b(a), gradient of a function b(a)
Let f (a) = ∂∂t pt (a)
Set greend (a) proportional to 1/‖a‖d−2, harmonic a 6= 0.
The partial diff equation becomes the Poisson equation:
∇T∇b(a) = f (a)
Solutionb(a) = (f ∗ green)(a)
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 26: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/26.jpg)
Candidate solution 1.
Solution of smallest L2 norm of Gt (a)pt (a) at a specific t
Let Gt (a)pt (a) = ∇b(a), gradient of a function b(a)
Let f (a) = ∂∂t pt (a)
Set greend (a) proportional to 1/‖a‖d−2, harmonic a 6= 0.
The partial diff equation becomes the Poisson equation:
∇T∇b(a) = f (a)
Solution, using ∇greend (a) = cd a/‖a‖d
∇b(a) = (f ∗ ∇greend )(a)
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 27: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/27.jpg)
Candidate solution 1.
Solution of smallest L2 norm of Gt (a)pt (a) at a specific t
Let Gt (a)pt (a) = ∇b(a), gradient of a function b(a)
Let f (a) = ∂∂t pt (a)
greend (a) proportional to 1/‖a‖d−2, harmonic for a 6= 0.
The partial diff equation becomes the Poisson equation:
∇T [Gt (a)pt (a)] = f (a)
Solution, using ∇greend (a) = cd a/‖a‖d
Gt (a)pt (a) = (f ∗ ∇greend )(a)
If A ⊂ Rd , set greenA(a, a) to be the associated Green’sfunction, replacing greend (a− a) in the convolution
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 28: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/28.jpg)
Candidate solution 1.
Solution of smallest L2 norm of Gt (a)pt (a) at a specific t
Let Gt (a)pt (a) = ∇b(a), gradient of a function b(a)
Let f (a) = ∂∂t pt (a)
greend (a) proportional to 1/‖a‖d−2, harmonic for a 6= 0.
The partial diff equation becomes the Poisson equation:
∇T [Gt (a)pt (a)] = f (a)
Solution, using ∇greend (a) = cd a/‖a‖d
Gt (a)pt (a) = (f ∗ ∇greend )(a)
If A ⊂ Rd , let greenA(a, a) replace greend (a− a).Then Gt (a)pt (a) tends to 0 as a tends to ∂A.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 29: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/29.jpg)
Candidate solution 1.
Solution of smallest L2 norm of Gt (a)pt (a) at a specific t
Let Gt (a)pt (a) = ∇b(a), gradient of a function b(a)
Let f (a) = ∂∂t pt (a)
greend (a) proportional to 1/‖a‖d−2, harmonic for a 6= 0.
The partial diff equation becomes the Poisson equation:
∇T [Gt (a)pt (a)] = f (a)
Solution, using ∇greend (a) = cd a/‖a‖d
Gt (a) =(f ∗ ∇greend )(a)
pt (a)
Comp. challenge: high-dimensional convolution integral.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 30: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/30.jpg)
Candidate solution 2.
Solution using 2−dimensional Green integrals
Write the pde ∇T [Gt (a)pt (a)] = f (a) in the coordinates Gt ,j
d∑j=1
∂
∂aj[Gt ,j(a)pt (a)] = f (a)
Pair consecutive terms to achieve a portion of the solution∑i∈{j,j+1}
∂
∂ai[Gt ,i(a)pt (a)] =
2d
f (a)
Solution, for each consecutive pair of coordinates,[Gt ,j(a)
Gt ,j+1(a)
]=
2d
(f ∗ ∇green2)(a)
pt (a)
The 2−dim Green’s function gradient acts on (aj ,aj+1).
Yields a numerical solution. Stable for particular J and p0?Do we lose the desirable boundary behavior?
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 31: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/31.jpg)
Candidate solution 2.
Solution using 2−dimensional Green integrals
Solution, for each consecutive pair of coordinates,[Gt ,j(a)
Gt ,j+1(a)
]=
2d
(f ∗ ∇green2)(a)
pt (a)
Stable for particular objective functions J?For p0 we use a product of 2−dimensional uniformdensities on unit disksStable if J(a) can exhibit only small change by changingtwo consecutive coordinatesTrue for Lipshitz sigmoids with variable replication. Termsφ(aT X ) represented using small η as φ
(η∑
j,r aj,r Xj
). For
each Xj the aggregate coefficient is aj = η∑rep
r=1 aj,r .Challenge is that G(a) is not necessarily zero at boundary.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 32: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/32.jpg)
Candidate solution 3. Solving 1-dim diff equations
Consider the equation with each aj in an interval [−1,1],
∂aj [Gj(a)p(a)] = ∂tpt (a)
Call the right side f (a). It is an ordinary diff equation in aj ifother coordinates a−j held fixed
∂
∂aj
[Gj(aj ,a−j)p(aj ,a−j)
]= f (aj ,a−j)
Solutions take the form
Gj(a) =1
p(a)
[∫ aj
−1f (aj ,a−j)daj − C(a−j)
]Natural choices for the "constant" C(a−j) are 0 or
S(a−j) =
∫ 1
−1f (aj ,a−j)daj
Gj(a)p(a) is either∫ aj−1 f (aj ,a−j)daj or −
∫ 1aj
f (aj ,a−j)daj
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 33: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/33.jpg)
Candidate solution 3. Solving 1-dim diff equations
Gj(a)p(a) is either∫ aj−1 f (aj ,a−j)daj or −
∫ 1aj
f (aj ,a−j)daj
These choices make Gj(a) be zero as aj approaches oneor the other of the end points, but not both.
Thus led to define the solution Gj(a) given by
1p(a)
[∫ aj
−1f (aj ,a−j)daj 1{S(a−j )≥0} −
∫ 1
aj
f (aj ,a−j)daj 1{S(a−j )<0}
]This makes move be inward near the non-zero edge.
If we pick j at random in 1, . . . ,d , get similar densityupdate.
The rule solves differential equation for the order h term.
It satisfies some desired boundary properties.
Challenge: Boundary slightly moved at the non-zero edge.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 34: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/34.jpg)
Candidate solution 4.
Perhaps the ideal solution is one of smallest L2 norm of Gt (a)
It has Gt (a) = ∇bt (a) equal to the gradient of a function
The pde in log density form
∇T Gt (a) + GTt (a)∇ log pt (a) =
∂
∂tlog pt (a)
then becomes an elliptic pde in bt (a) for fixed t.With ∇ log pt (a) and ∂
∂t log pt (a) arranged to be bounded,the solution may exist and be nice.But explicit solution to this elliptic pde is not available(except perhaps numerically in low dim cases).
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 35: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/35.jpg)
Candidate solution 4.
Ideal solution of smallest L2 norm of Gt (a)
It has Gt (a) = ∇bt (a) equal to the gradient of a function
The pde in log density form
∇T Gt (a) + GTt (a)∇ log pt (a) =
∂
∂tlog pt (a)
then becomes an elliptic pde in bt (a) for fixed t.With ∇ log pt (a) and ∂
∂t log pt (a) arranged to be bounded,the solution may exist and be nice.But explicit solution to this elliptic pde is not available(except perhaps numerically in low dim cases).Next seek approximate solution.For ridge bases, decompose into a system of first orderdifferential equations and integrate.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 36: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/36.jpg)
Candidate solution 5 by decomposition of ridge sum
Optimize J(a) =∑n
i=1 ri φ(X Ti a)
Target density pt (a) = e tJ(a)−ct p0(a) with c′t = Ept [J]
The time score is ∂∂t log pt (a) = J(a)− Ept [J]
Specialize the pde in log density form
∇T Gt (a) + GTt (a)∇ log pt (a) = J(a)− Ept [J]
The right side, setting bi,t = Ept [φ(xTi a)], takes the form of
a sum ∑ri [φ(X T
i a)− bi,t ].
Likewise ∇ log pt (a) = t ∇J(a) +∇ log p0(a) is the sum∑Xi
[t riφ
′(X Ti a)− (1/n)(X T
i a)]
Use a Gaussian initial distribution with log p0(a) equal to
−(1/2n)∑
aT XiX Ti a.
Account for prior by appending d extra input vectors as columnsof the identity with a multiple of the prior score in place of φ′.Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 37: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/37.jpg)
Approximate solution for ridge sums
Seek approximate solution of the form
Gt (a) =∑ xi
‖xi‖2gi(u)
with u = (u1, . . . ,un) evaluated at ui = X Ti a, for which
∇T Gt (a) =∑
i
∂
∂uigi(u) +
∑i,j:i 6=j
xTi xj
‖xi‖2∂
∂ujgi(u)
Can we ignore the coupling in the derivative terms?xT
j xi/‖xi‖2 are small for uncorrelated designs, large d .Match the remaining terms in the sums to solve for gi(u)
Arrange gi(u) to solve the differential equations
∂
∂uigi(u) + gi(u)
[t riφ
′(ui)− ui/n + resti]
= ri[φ(ui)−bi,t
]where resti =
∑j 6=i [t rj φ
′(uj)− uj/n]xTj xi/‖xi‖2.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 38: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/38.jpg)
Integral form of solution
Differential equation for gi(ui), suppressing dependence onthe coordinates other than i∂
∂uigi(ui) + gi(ui)
[t riφ
′(ui)− ui/n + resti]
= ri[φ(ui)−bi,t
]Define the density factor
mi(ui) = et ri φ(ui )−u2i /2n + ui resti
Allows the above diff equation to be put back in the form
∂
∂ui[gi(ui) mi(ui)] = ri
[φ(ui)−bi,t
]mi(ui)
An explicit solution, evaluated at ui = xTi a, is
gi(ui) = ri
∫ uici
mi(ui)[φ(ui)−bi,t
]dui
mi(ui)
where ci = ci,t is such that φ(ci,t ) = bi,t .
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 39: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/39.jpg)
The derived change function Gt for evolution of at
Include the uj for j 6= i upon which resti depends. Oursolution for gi,t (u) is
ri
∫ ui
ci
et ri (φ(ui )−φ(ui ))−(u2i −u2
i )/2n +t(ui−ui )resti (u)[φ(ui)−bi,t
]dui
Evaluating at u = Xa we have the change function
Gt (a) =∑ xi
‖xi‖2gi,t (Xa)
for which at evolves according to
at+h = at − h Gt (at )
For showing gi,t , Gt and ∇Gt are nice, assume theactivation function φ and its derivative is bounded (e.g. alogistic sigmoid or a sinusoid).Run several optimization paths in parallel, starting fromindependent choices of a0. Allows access to empiricalcomputation of bi,t = Ept [φ(xT
i at )]Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 40: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/40.jpg)
Conjectured conclusion
Derived the desired optimization procedure and the following.
Conjecture: With step size h of order 1/n2 and a number ofsteps of order n d log d and X1,X2, . . . ,Xn i.i.d. Normal(0, I).With high probability on the design X, the above procedureproduces optimization paths at whose distribution closely tracksthe target
pt (a) = et J(a)−ct p0(a)
such that, with high probability, the solutions paths haveinstances of J(at ) which are at least 1/2 the maximum.
Consequently, the relaxed greedy procedure is computationallyfeasible and achieves the indicated bounds for sparse linearcombinations from the dictionary Φ = {φ(aT x) : a ∈ Rd}.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 41: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/41.jpg)
Summary
Flexible approximation modelsSubset selectionNonlinearly parameterized bases as with neural nets`1 control on coefficients of combination
Accurate approximation with moderate number of termsProof analogous to random coding
Information theoretic risk boundsBased on the minimum description length principleShows accurate estimation, even for very large dimension
Computational challenges are being addressed byAdaptive annealingNonlinear power methods
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 42: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/42.jpg)
Tensor and nonlinear power methods (overview)
Know design distribution p(X )
Target f (x) =∑mo
k=1 gk (aTkx) is a combination of ridge
functions with distinct linearly independent directions ak
Ideal: maximize E [f (X )φ(aTX )] or (1/n)∑
i Yiφ(aTXi)
Score functions operating on f (X ) and f (X ) g(aT X ) yieldpopulation and sample versions of tensors
E[
∂3
∂Xj1∂Xj2∂Xj3f (X )
]and nonlinearly parameterized matrixes
E[(∇∇T f (X ))g(aTX )
]Spectral decompositions then identify the directions ak
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 43: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/43.jpg)
Score method for representing expected derivatives
Score function (tensor) S`(X ) of order ` from known p(X )
Sj1,...j`(X ) p(X ) = (−1)`∂`
∂Xj1 · ∂Xj`p(X )
Gaussian score: S1(X ) = X ,
S2(X ) = XX T − I,
S3j1,j2,j3(X ) = Xj1Xj2Xj3 − Xj11j2,j3 − Xj21j1,j3 − Xj31j1,j2 .
Expected derivative:
E[
∂`
∂Xj1 · ∂Xj`f (X )
]= E
[f (X )Sj1,...j`(X )
]Repeated integration by parts
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 44: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/44.jpg)
Expected derivatives of ridge combinations
Ridge combination target functions:
f (X ) =mo∑
k=1
gk (aTk X )
Expected Hessian of f (X )
M =mo∑
k=1
akaTk E [g
′′
k (aTk X )] = E
[f (X )S2(X )
].
Principle eigenvector:
maxa
{aT M a
}Linear power method finds ak if othogonal (the’re not).
Third order array (Anandkumar et al 2015, draft):mo∑
k=1
aj1,kaj2,kaj3,kE [g′′′
k (aTk X )] = E
[f (X )Sj1,j2,j3(X )
]can be whitened and a quadratic power method finds ak .
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 45: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/45.jpg)
Scoring a Ridge Function
A suitable activation function φ(a,X ) for optimization of
E [f (X )φ(a,X )]
Matrix scoring of a ridge function g(aTX ):
Ma,X = S2g(aTX )+[S1aT +a(S1)T ]g′(aTX )+[aaT ]g′′(aTX )
Activation function formed by scoring a ridge function
φ(a,X ) = aT [Ma,X ]a
= (aT S2a)g(aTX ) + 2(aT S1)(aTa)g′(aTX ) + (aTa)2g′′(aTX )
Scoring a ridge function permits finding the component ofφ(a,X ) in the target function using
E [f (X )φ(a,X )] = aT E [f (X )Ma,X ]a = aT E [(∇∇T f (X ))g(aTX )]a
Twice itegrating by parts
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 46: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/46.jpg)
Scoring a Ridge Function (Gaussian design case)
Matrix scoring of a ridge function g(aTX ):
Ma,X = S2g(aTX )+[S1aT +a(S1)T ]g′(aTX )+[aaT ]g′′(aTX )
Activation function formed by scoring a ridge function
φ(a,X ) = aT [Ma,X ]a
= (aT S2a)g(aTX ) + 2(aT S1)(aTa)g′(aTX ) + (aTa)2g′′(aTX )
Gaussian design case, simplifying when ‖a‖ = 1:
φ(aT X ) = [(aTX )2 − 1]g(aTX ) + [2aTX ]g′(aTX ) + g′′(aTX )
φ(z) = (z2−1)g(z) + 2z g′(z) + g′′(z)
Hermite poly: If g(z)=H`−2(z) then φ(z)=H`(z) for ` ≥ 2.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 47: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/47.jpg)
Scored Ridge Function Decomposes E [f (X )φ(a,X )]
Matrix scored ridge function, providing φ(a,X ) = aT Ma,X a,
Ma,X = S2g(aTX ) + [SaT + aST ]g′(aTX ) + [aaT ]g′′(aTX )
The amount of φ(a,X ) in f (X ) via the matrix decomposition
Ma = E [f (X )Ma,X ] = E [(∇∇T f (X ))g(aTX )] =∑mo
k=1 ak aTk Gk (ak ,a)
is quantified by
E [f (X )φ(a,X )] = aT [Ma]a =
m0∑k=1
(aTk a)2Gk (ak ,a)
Here Gk (ak ,a) = E [g′′k (aTkX )g(aTX )] measures the
strength of the match of a to the direction ak .
It replaces E [g′′k (aTkX )ST ]a = (aT
k a)E [g′′′k (aTkX )] in the
tensor method of Anandkumar et al
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 48: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/48.jpg)
Using Sinusoids or Sigmoids
The amount of φ(a,X ) in f (X ) via the matrix decomposition
Ma = E [f (X )Ma,X ] =∑mo
k=1 ak aTk Gk (ak ,a)
quantified by
E [f (X )φ(a,X )] = aT [Ma]a =
m0∑k=1
(aTk a)2Gk (ak ,a)
Here Gk (ak ,a) = E [g′′k (aTkX )g(aTX )] measures the
strength of the match of a to the direction ak .
cos(z), sin(z) case, with X standard multivariate Normal:
gk (aTkX ) = −ckei aT
k X and g(aTX ) = e−i aTX
expected product Gk (ak ,a) = cke−(1/2)‖ak−a‖2
Step sigmoid case φ(z) = 1{z>0}: The Gk (ak ,a) isdetermined by the angle between ak and a.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 49: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/49.jpg)
Using Hermite polynomials
The amount of φ(a,X ) in f (X ) via the matrix decomposition
Ma = E [f (X )Ma,X ] =∑mo
k=1 ak aTk Gk (ak ,a)
is given by
E [f (X )φ(a,X )] = aT [Ma]a =
m0∑k=1
(aTk a)2Gk (ak ,a)
Here Gk (ak ,a) = E [g′′k (aTk X )g(aT X )] measures the
strength of the match of a to the direction ak .
Hermite case: g(z) = H`−2(z), with X ∼ Normal(0, I).H`(aTX ) and H`′(aT
kX ) orthonormal for `′ 6= `.
Gk (ak ,a) = ck ,` (aTka)`
with ck ,` = E [gk (Z )H`(Z )] in gk (z) =∑
`′ ck ,`′H`′(z)
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 50: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/50.jpg)
Nonlinear Power Method
Maximize J(a) = E [f (X )φ(a,X )] = aT Ma a, s.t. ‖a‖=1
Cauchy-Schwartz inequality:
aT Ma a ≤ ‖a‖ ‖Maa‖
with equality iff a is proportional to Maa.Motivates the mapping of the nonlinear power method
V (a) =Ma a‖Maa‖
Seek fixed points a∗ = V (a∗) via iterations at = V (at−1).Construct a whitened version.Verify that J(at ) is increasing.The nonlinear power method provides maximizers of
J(a) = E [f (X )φ(a,X )]
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 51: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/51.jpg)
Analysis with Whitening
Suppose mo ≤ d (# components ≤ dimension)
Let Ref =∑
k akaTk βk be a reference matrix,
for instance Ref = Maref has βk = Gk (ak ,aref ),and let Q D QT be its eigen-decomposition.
Let W = QD−1/2 be the whitening matrix:
I = W T Ref W =∑
k
(W Tak )(aTk W )βk =
∑k
αk αTk
with orthonormal directions
αk = W Tak√βk
Represent a = W u/‖Wu‖ = W u√β for unit vectors u.
Then aTak = uTαk (β/βk )1/2
Let uref be the unit vector prop to W−1aref = D1/2QTaref
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 52: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/52.jpg)
Analysis of the Nonlinear Power Method
Criterion E [f (X )φ(a,X )] = aT Ma a = uT Mu u where
Mu =∑
k αkαTk Gk (αk ,u)β/βk
and Gk is Gk with ak ,a expressed via αk ,u. Example
Gk (αTk u) = ck ,` (αT
k u)` (β/βk )`/2
Mu =∑
k αkαTk (αT
ku/αTkuref )`
The power mapping at = Mat−1at−1/‖ · ‖ corresponds to
ut = Mut−1ut−1/‖ · ‖
Provably rapidly convergent, when Gk is increasing in αTku.
Limit of ut is u∗ = ±αk with largest initial (αTku0/α
Tkuref )`.
Each +αk or −αk is a local maximizer.Global maximizer corresponds to largest 1/|αT
kuref |Corresponding maximizer of aT Ma a is a∗ prop to Wu∗.
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 53: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/53.jpg)
Analysis of Nonlinear Power Method, Polynomial Case
Let ck (t) = αTk ut be coefficient of ut in the direction αk
Let ck ,ref = αTk uref be coefficient of uref in direction αk
Mut =∑
k αk αTk (αT
kut/αTkuref )`
So thatMut ut =
∑k αk (αT
k ut ) (αTkut/α
Tkuref )`
Thus the coefficienct for ut+1 satisfies the recursion:
ck (t +1) =
[ck (t)/ck ,ref
]`+1 ck ,ref
[∑
k ( )2]1/2
By induction
ck (t) =
[ck (0)/ck ,ref
](`+1)tck ,ref
[∑
k ( )2]1/2
It rapidly concentrates on the index k with the largest
ck (0)
ck ,ref=
αTk u0
αTk uref
Andrew Barron High-Dim Neural Nets: Statistics and Computation
![Page 54: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICSarb4/presentations/BarronOsijek... · Presentation, 27 September, 2016 International Conference on Operational Research KOI](https://reader034.vdocuments.net/reader034/viewer/2022043013/5faec45959090d4cd25b30ce/html5/thumbnails/54.jpg)
Analysis of Nonlinear Power Method, Polynomial Case
Suppose k = 1 has the largest
ck (0)
ck ,ref=
αTk u0
αTkuref
with the others less by the factor 1−∆. Then
‖ut − α1‖2 ≤ 2(1−∆)2(`+1)t
Moreover J(at ) = E [f (X )φ(at ,X )] = uTt Mut ut equals∑
k[ck (t)/ck ,ref
]`+2 c2k ,ref
which is strictly increasing in t , proven by applications ofHolder’s inequalityFactor of increase quantified by the exponential of arelative entropy.The increase each step is large unless c2
k (t) is close toconcentrated on the maximizers of αT
k u0/αTk uref .
Andrew Barron High-Dim Neural Nets: Statistics and Computation