andrew r. barron yale university department of statisticsarb4/presentations/barronosijek... ·...

High-Dimensional Neural Networks: Statistical andComputational Properties

Andrew R. Barron

YALE UNIVERSITY

DEPARTMENT OF STATISTICS

Presentation, 27 September, 2016

International Conference on Operational Research

KOI 2016, Osijek, Croatia

Joint work with Jason Klusowski

Outline

Flexible high-dimensional function estimation usingsinusoid, sigmoid, ramp or polynomial activation functions

Approximation and estimation bounds to reveal the effectof dimension, model size, and sample size

Computation with greedy term selectionAdaptive Annealing Method (for general designs) to guideparameter searchNonlinear Power Method (for specific designs) to improveupon tensor methods for parameter estimation

Andrew Barron High-Dim Neural Nets: Statistics and Computation

Example papers for some of what is to follow

Papers illustrating my background addressing these questionsof high dimensionality (available from www.stat.yale.edu)

A. R. Barron, R. L. Barron (1988). Statistical learningnetworks: a unifying view. Computing Science & Statistics:Proc. 20th Symp on the Interface, ASA, p.192-203.A. R. Barron (1993). Universal approximation bounds forsuperpositions of a sigmoidal function. IEEE Transactionson Information Theory, Vol.39, p.930-944.A. R. Barron, A. Cohen, W. Dahmen, R. DeVore (2008).Approximation and learning by greedy algorithms. Annalsof Statistics, Vol.36, p.64-94.J. M. Klusowski, A. R. Barron (2016) Risk bounds forhigh-dimensional ridge function combinations includingneural networks, Submitted to the Annals of Statistics.arXiv:1607.01434v1


Data Setting

Data: (X i ,Yi), i = 1,2, . . . ,n

Inputs: explanatory variable vectors (arbitrary dependence)

X i = (Xi,1,Xi,2, . . . ,Xi,d )

Domain: Cube in Rd

Random design: independent X i ∼ P

Output: response variable Yi in RBounded or subgaussian

Relationship: E [Yi |X i ] = f (X i) as in:Perfect observation: Yi = f (X i )

Noisy observation: Yi = f (X i ) + εi with εi indep

Function: f (x) unknown


Activation functions

Activation functions denoted φ(z) or g(z)

Piecewise constant: 1{z−b≥0} or sgn(z−b)

Sigmoid: (ez − e−z)/(ez + e−z)

Linear spline, ramp: (z − b)+

Sinusoidal: cos(2πf z), sin(2πf z)

Polynomial: standard z`, Hermite H`(z)

Multivariate Activation functionsbuilt from products or from ridge forms: φ(aTx)

constructed using univariate function φinternal parameter vector a of dimension d .


Product and Ridge Bases

Product bases: for polynomials, sinusoids, splinesusing continuous powers, frequencies or thresholds

φ(x ,a) = φ(x1,a1)φ(x2,a2) · · ·φ(xd ,ad )

Ridge bases: for projection pursuit regression, sinusoids,neural networks and polynomial networks:

φ(x ,a) = φ(aT x) = φ(a1x1 + a2x2 + . . .+ adxd )


Flexible multivariate function modeling

Internally parameterized models & nonlinear least squares

Functions fm(x) =∑m

j=1 ckφ(x ,ak ) in the span of aparameterized dictionary Φ = {φa(·) = φ(·,a) : a ∈ A}with parameter set A ⊂ Rd

Flexible function approximation

Statistically challenging

Computationally challenging


Notation

Response vector: Y = (Yi)ni=1 in Rn

Dictionary vectors: Φ(n) ={

(φ(X i ,a))ni=1 : a ∈ A

}⊂ Rn

Sample squared norm: ‖f‖2(n) = 1n∑n

i=1 f 2(X i)

Population squared norm: ‖f‖2 =∫

f 2(x)P(dx)

Normalized dictionary condition: ‖φ‖ ≤ 1 for φ ∈ Φ


Flexible m−term nonlinear optimization

Impractical complete nonlinear least squares optimization

Sample version

fm achieves min(aj ,cj )

mj=1

‖Y −m∑

j=1

cj φaj‖2(n)

Population version

fm achieves min(aj ,cj )

mj=1

‖f −m∑

j=1

cj φaj‖2

Optimization of (aj , cj)mj=1 in R(d+1)m.


GREEDY OPTIMIZATIONS

Optimize one term at a time

Step 1: Choose a1, c1 to produce a single best term:

sample version: mina,c ‖Y − cφa‖2(n)

population version: mina,c ‖f − cφa‖2

Step m > 1: Arrange

fm = α fm−1 + c φa

with am, cm, αm providing the term with best improvement:

sample version: mina,c,α ‖Y − αfm−1 − cφa‖2(n)

population version: mina,c,α ‖f − αfm−1 − cφa‖2


Acceptable variants of greedy optimization

Step m > 1: Arrange

fm = α fm−1 + c φa

with am, cm, αm providing the term with best improvement:least squares: mina,c,α ‖Y − αfm−1 − cφa‖2

(n)

Acceptable variantsInner-product maximization: With Resi = Yi − fm−1(X i )choose am to achieve maxa

∑ni=1 Resi φ(X i ,a) to within a

constant factor

Foward stepwise selection: Given a1, . . . ,am−1, choose amto obtain mina d(Y , span{φθ1

. . . φam−1, φa})

Orthogonal matching pursuit: Project onto span afterchoosing inner-product maximizer.

`1 Penalization: Update vm = αmvm−1 + cm. Choose αm, cm

to achieve minα,c ‖Y − αfm−1 − cφa‖2(n) + λ[αvm−1 + c]


Basic m−term approximation and computation boundsFor complete or greedy approximation (B. 1993, Lee et al 1995)

Population version:

‖f − fm‖ ≤‖f‖Φ√

mand moreover

‖f − fm‖2 ≤ infg

{‖f − g‖2 +

2‖g‖2Φ

m

}Sample version:

‖Y − fm‖2(n) ≤ ‖Y − f‖2

(n) +2‖f‖2

Φ

m

Optimization with `1 penalization:

‖Y − fm‖2(n) + λ‖fm‖Φ ≤ inf

g

{‖Y − f‖2

(n) + λ‖g‖Φ +2‖g‖2

Φ

m

}

Here ‖g‖Φ is the minimal `1 norm of coefficients of g in span of Φ


The norm of f with respect to the dictionary Φ

Minimal `1 norm on coefficients in approximation of f

‖f‖Φ = limε→0

inf{∑

j

|cj | : ‖∑

j

cjφaj− f‖ ≤ ε

}called the variation of f with respect to Φ (B. 1991)

‖f‖Φ = VΦ(f ) = inf{V : f/V ∈ closure(conv(±Φ))}

It appears in the bound ‖f − fm‖ ≤ ‖f‖Φ√m

Later also called the atomic norm of f with respect to Φ

In the case of the signum activation function it matches theminimal range of neural nets arbitrarily well approximating f


Greedy proof of the approximation bound:

Consider the case ‖f‖Φ = 1Take Φ to be closed under sign changes.The minφ is not more than aveφTake average with respect to the weights representing f

‖f − fm‖2 ≤ minφ‖f − (1− λ)fm−1 − λφ‖2

≤ aveφ‖f − (1− λ)fm−1 − λφ‖2

= (1− λ)2‖f − fm−1‖2 + λ2

Bound follows by induction with λ = 1/m

‖f − fm‖2 ≤1m

Jones (AS 1992), B. (IT 1993)extensions: Lee et al (IT 1995), DeVore et al (AS 2008)


Relating the variation to a spectral norm of f

Neural net approximation using the Fourier representation

f (x) =

∫ei ω·x f (ω) dω.

L1 spectral norm: ‖f‖spectrum,s =∫

Rd |f (ω)| ‖ω‖s1 dωLet Φ be the dictionary for sinusoids, sigmoids or ramps.With unit cube domain for x , the variation ‖f‖Φ satisfies

‖f‖sinusoid = ‖f‖spectrum,0

‖f‖sigmoid ≤ ‖f‖spectrum,1

‖f‖ramp ≤ ‖f‖spectrum,2

For sinusoid, sigmoid cases the parameter set is A = Rd

For the ramp case A = {a : ‖a‖1 ≤ 1}.As we said, this ‖f‖Φ appears in the approximation bound

‖f − fm‖ ≤ ‖f‖Φ/√

m.


Statistical Risk

Statistical risk E‖fm − f‖2 = E(fm(X )− f (X ))2

Expected squared generalization error on new X ∼ P

Minimax optimal risk bound, via information theory

E‖fm − f‖2 ≤ ‖fm − f‖2 + cmn

log N(Φ, δ)

where log N(Φ, δ) is the metric entropy of Φ at δm = 1/m

With greedy optimization using `1 penalty or suitable m

E‖f − f‖2 ≤ ming.m

{‖g − f‖2 +

‖g‖2Φm

+ cmn

log N(Φ, δm)

}achieves ideal approximation, complexity tradeoff.


Statistical Risk with MDL Selection

Again, general risk bound, using metric entropy log N(Φ, δ)

E‖f − f‖2 ≤ ming.m

{‖g − f‖2 +

‖g‖2Φm

+ cmn

log N(Φ, δm)

}.

For greedy algorithm with Minimum Description Length m

minm

{‖Y − fm‖2(n) + 2c

mn

log N(Φ, δ)}

Performs as well as if the best m∗ were known in advance.

`1 penalty retains MDL interpretation and risk (B.,Huang,Li, Luo,2008)

Risk bound specializes when ‖f‖Φ is finite

E‖f − f‖2 ≤ minm

{‖f‖2Φ

m+ c

mn

log N(Φ, δm)

}≤ c ‖f‖Φ

(1n

log N(Φ, δm∗)

)1/2


Statistical risk for neural nets

Specialize the metric entropy log N(Φ, δ) (Klusowski, Barron 2016).

It is not more than order d log(1/δ) for Lipshitz activationfunctions such as sigmoids.

With `1 constrained internal parameters, as in ramp casewith finite ‖f‖spectrum,2, also not more than order (1/δ) log d

Risk bound is ‖f‖Φ[dn log(n/d)]1/2 or ‖f‖4/3

Φ [ 1n log d ]1/3,

whichever is smallest.

The [(log d)/n]1/3 is for the no-noise case. For the casewith noise that is sub-exponential and sub-Gaussian, mayreplace it by [(log d)/n]1/4, to within log n factors.

Implication: Can allow d >> n provided n is large enoughto accomodate the worse exponent of 1/3 in place of 1/2.


Confronting the computational challenge

Greedy searchReduces dimensionality of optimization from md to just dObtain a current am achieving within a constant factor of themaximum of

Jn(a) =1n

n∑i=1

Ri φ(X i ,a).

This surface can still have many maxima.We might get stuck at a spurious local maximum.

New computational strategies identify approximate maximawith high probability

1 Adaptive Annealing

2 Third-order Tensor Methods (pros and cons)

3 Nonlinear Power Methods

These are stochastically initialized search methods


Optimization path for bounded ridge bases

Adaptive Annealing:A more general approach to seek approx optimization of

J(a) =∑n

i=1ri φ(aT X i)

recent & current work with Luo, Chatterjee, KlusowskiSample at from the evolving density

pt (a) = e t J(a)−ct p0(a)

along a sequence of values of t from 0 to tfinal

use tfinal of order (d log d)/nInitialize with a0 drawn from a product prior p0(a):

Uniform[−1,1] for each coefficient in bounded a caseNormal(0,I) or product of Cauchy in unbounded a case

Starting from the random a0 define the optimization path atsuch that its distribution tracks the target density pt .


Optimization path

Adaptive Annealing: Arrange at from the evolving density

pt (a) = etJ(a)−ct p0(a)

with a0 drawn from p0(a)

State evolution with vector-valued change function Gt (a):

at+h = at − h Gt (at )

Better state evolution: at+h = a∗ is the solution to

at = a∗ + h Gt (a∗),

with small step-size h, such that a + h Gt (a) is invertiblewith a positive definite Jacobian, and solves equations forthe evolution of pt (a).As we will see there are many such change functionsGt (a), though not all are nice.


Boundary requirements

Boundary requirementsSuppose a is restricted to a bounded domain A withsmooth boundary ∂A

For a ∈ ∂A, let va be outward normal to ∂A at a.Either Gt (a) = 0 (vanishing at the boundary)

Or Gt (a)T va ≥ 0 (to move inward, not move outside)

Likewise, for a near ∂A , if Gt (a) approaches 0, it shoulddo so at order not larger than the distance of a from ∂A.


Solve for the change Gt to track the density pt

Density evolution: by the Jacobian rule

pt+h(a) = pt(a + h Gt (a)

)det(I + h∇GT

t (a))

Up to terms of order h

pt+h(a) = pt (a) + h[(Gt (a))T ∇pt (a) + pt (a)∇T Gt (a)

]In agreement for small h with the partial diff equation

∂

∂tpt (a) = ∇T [Gt (a)pt (a)

]The right side is GT

t (a)∇pt (a) + pt (a)∇T Gt (a). Dividing bypt (a) it is expressed in the log density form

∂

∂tlog pt (a) = ∇T Gt (a) + GT

t (a)∇ log pt (a)


Five candidate solutions

Five solutions to the partial differential equation at time t

∇T [G(a)pt (a)]

= ∂t pt (a)

1 Solution in which G(a)p(a) is a gradient

2 Solution using pairs of coefficients

3 Solution with j random, ∂aj [Gj(a)p(a)] provides ∂tpt (a)

4 Solution in which G(a) is a gradient

5 Approximate solutions expressed in terms of ui = X Ti a.


Candidate solution 1.

Solution of smallest L2 norm of Gt (a)pt (a) at a specific t .

Let Gt (a)pt (a) = ∇b(a), gradient of a function b(a)

Let f (a) = ∂∂t pt (a)

Set greend (a) proportional to 1/‖a‖d−2, harmonic a 6= 0.

The partial diff equation becomes the Poisson equation:

∇T∇b(a) = f (a)

Solutionb(a) = (f ∗ green)(a)



Solution of smallest L2 norm of Gt (a)pt (a) at a specific t



Set greend (a) proportional to 1/‖a‖d−2, harmonic a 6= 0.


∇T∇b(a) = f (a)

Solution, using ∇greend (a) = cd a/‖a‖d

∇b(a) = (f ∗ ∇greend )(a)






greend (a) proportional to 1/‖a‖d−2, harmonic for a 6= 0.


∇T [Gt (a)pt (a)] = f (a)


Gt (a)pt (a) = (f ∗ ∇greend )(a)

If A ⊂ Rd , set greenA(a, a) to be the associated Green’sfunction, replacing greend (a− a) in the convolution








∇T [Gt (a)pt (a)] = f (a)


Gt (a)pt (a) = (f ∗ ∇greend )(a)

If A ⊂ Rd , let greenA(a, a) replace greend (a− a).Then Gt (a)pt (a) tends to 0 as a tends to ∂A.








∇T [Gt (a)pt (a)] = f (a)


Gt (a) =(f ∗ ∇greend )(a)

pt (a)

Comp. challenge: high-dimensional convolution integral.



Solution using 2−dimensional Green integrals

Write the pde ∇T [Gt (a)pt (a)] = f (a) in the coordinates Gt ,j

d∑j=1

∂

∂aj[Gt ,j(a)pt (a)] = f (a)

Pair consecutive terms to achieve a portion of the solution∑i∈{j,j+1}

∂

∂ai[Gt ,i(a)pt (a)] =

2d

f (a)

Solution, for each consecutive pair of coordinates,[Gt ,j(a)

Gt ,j+1(a)

]=

2d

(f ∗ ∇green2)(a)

pt (a)

The 2−dim Green’s function gradient acts on (aj ,aj+1).

Yields a numerical solution. Stable for particular J and p0?Do we lose the desirable boundary behavior?



Solution using 2−dimensional Green integrals

Solution, for each consecutive pair of coordinates,[Gt ,j(a)

Gt ,j+1(a)

]=

2d

(f ∗ ∇green2)(a)

pt (a)

Stable for particular objective functions J?For p0 we use a product of 2−dimensional uniformdensities on unit disksStable if J(a) can exhibit only small change by changingtwo consecutive coordinatesTrue for Lipshitz sigmoids with variable replication. Termsφ(aT X ) represented using small η as φ

(η∑

j,r aj,r Xj

). For

each Xj the aggregate coefficient is aj = η∑rep

r=1 aj,r .Challenge is that G(a) is not necessarily zero at boundary.


Candidate solution 3. Solving 1-dim diff equations

Consider the equation with each aj in an interval [−1,1],

∂aj [Gj(a)p(a)] = ∂tpt (a)

Call the right side f (a). It is an ordinary diff equation in aj ifother coordinates a−j held fixed

∂

∂aj

[Gj(aj ,a−j)p(aj ,a−j)

]= f (aj ,a−j)

Solutions take the form

Gj(a) =1

p(a)

[∫ aj

−1f (aj ,a−j)daj − C(a−j)

]Natural choices for the "constant" C(a−j) are 0 or

S(a−j) =

∫ 1

−1f (aj ,a−j)daj

Gj(a)p(a) is either∫ aj−1 f (aj ,a−j)daj or −

∫ 1aj

f (aj ,a−j)daj


Candidate solution 3. Solving 1-dim diff equations

Gj(a)p(a) is either∫ aj−1 f (aj ,a−j)daj or −

∫ 1aj

f (aj ,a−j)daj

These choices make Gj(a) be zero as aj approaches oneor the other of the end points, but not both.

Thus led to define the solution Gj(a) given by

1p(a)

[∫ aj

−1f (aj ,a−j)daj 1{S(a−j )≥0} −

∫ 1

aj

f (aj ,a−j)daj 1{S(a−j )<0}

]This makes move be inward near the non-zero edge.

If we pick j at random in 1, . . . ,d , get similar densityupdate.

The rule solves differential equation for the order h term.

It satisfies some desired boundary properties.

Challenge: Boundary slightly moved at the non-zero edge.



Perhaps the ideal solution is one of smallest L2 norm of Gt (a)

It has Gt (a) = ∇bt (a) equal to the gradient of a function

The pde in log density form

∇T Gt (a) + GTt (a)∇ log pt (a) =

∂

∂tlog pt (a)

then becomes an elliptic pde in bt (a) for fixed t.With ∇ log pt (a) and ∂

∂t log pt (a) arranged to be bounded,the solution may exist and be nice.But explicit solution to this elliptic pde is not available(except perhaps numerically in low dim cases).



Ideal solution of smallest L2 norm of Gt (a)

It has Gt (a) = ∇bt (a) equal to the gradient of a function

The pde in log density form

∇T Gt (a) + GTt (a)∇ log pt (a) =

∂

∂tlog pt (a)

then becomes an elliptic pde in bt (a) for fixed t.With ∇ log pt (a) and ∂

∂t log pt (a) arranged to be bounded,the solution may exist and be nice.But explicit solution to this elliptic pde is not available(except perhaps numerically in low dim cases).Next seek approximate solution.For ridge bases, decompose into a system of first orderdifferential equations and integrate.


Candidate solution 5 by decomposition of ridge sum

Optimize J(a) =∑n

i=1 ri φ(X Ti a)

Target density pt (a) = e tJ(a)−ct p0(a) with c′t = Ept [J]

The time score is ∂∂t log pt (a) = J(a)− Ept [J]

Specialize the pde in log density form

∇T Gt (a) + GTt (a)∇ log pt (a) = J(a)− Ept [J]

The right side, setting bi,t = Ept [φ(xTi a)], takes the form of

a sum ∑ri [φ(X T

i a)− bi,t ].

Likewise ∇ log pt (a) = t ∇J(a) +∇ log p0(a) is the sum∑Xi

[t riφ

′(X Ti a)− (1/n)(X T

i a)]

Use a Gaussian initial distribution with log p0(a) equal to

−(1/2n)∑

aT XiX Ti a.

Account for prior by appending d extra input vectors as columnsof the identity with a multiple of the prior score in place of φ′.Andrew Barron High-Dim Neural Nets: Statistics and Computation

Approximate solution for ridge sums

Seek approximate solution of the form

Gt (a) =∑ xi

‖xi‖2gi(u)

with u = (u1, . . . ,un) evaluated at ui = X Ti a, for which

∇T Gt (a) =∑

i

∂

∂uigi(u) +

∑i,j:i 6=j

xTi xj

‖xi‖2∂

∂ujgi(u)

Can we ignore the coupling in the derivative terms?xT

j xi/‖xi‖2 are small for uncorrelated designs, large d .Match the remaining terms in the sums to solve for gi(u)

Arrange gi(u) to solve the differential equations

∂

∂uigi(u) + gi(u)

[t riφ

′(ui)− ui/n + resti]

= ri[φ(ui)−bi,t

]where resti =

∑j 6=i [t rj φ

′(uj)− uj/n]xTj xi/‖xi‖2.


Integral form of solution

Differential equation for gi(ui), suppressing dependence onthe coordinates other than i∂

∂uigi(ui) + gi(ui)

[t riφ

′(ui)− ui/n + resti]

= ri[φ(ui)−bi,t

]Define the density factor

mi(ui) = et ri φ(ui )−u2i /2n + ui resti

Allows the above diff equation to be put back in the form

∂

∂ui[gi(ui) mi(ui)] = ri

[φ(ui)−bi,t

]mi(ui)

An explicit solution, evaluated at ui = xTi a, is

gi(ui) = ri

∫ uici

mi(ui)[φ(ui)−bi,t

]dui

mi(ui)

where ci = ci,t is such that φ(ci,t ) = bi,t .


The derived change function Gt for evolution of at

Include the uj for j 6= i upon which resti depends. Oursolution for gi,t (u) is

ri

∫ ui

ci

et ri (φ(ui )−φ(ui ))−(u2i −u2

i )/2n +t(ui−ui )resti (u)[φ(ui)−bi,t

]dui

Evaluating at u = Xa we have the change function

Gt (a) =∑ xi

‖xi‖2gi,t (Xa)

for which at evolves according to

at+h = at − h Gt (at )

For showing gi,t , Gt and ∇Gt are nice, assume theactivation function φ and its derivative is bounded (e.g. alogistic sigmoid or a sinusoid).Run several optimization paths in parallel, starting fromindependent choices of a0. Allows access to empiricalcomputation of bi,t = Ept [φ(xT

i at )]Andrew Barron High-Dim Neural Nets: Statistics and Computation

Conjectured conclusion

Derived the desired optimization procedure and the following.

Conjecture: With step size h of order 1/n2 and a number ofsteps of order n d log d and X1,X2, . . . ,Xn i.i.d. Normal(0, I).With high probability on the design X, the above procedureproduces optimization paths at whose distribution closely tracksthe target

pt (a) = et J(a)−ct p0(a)

such that, with high probability, the solutions paths haveinstances of J(at ) which are at least 1/2 the maximum.

Consequently, the relaxed greedy procedure is computationallyfeasible and achieves the indicated bounds for sparse linearcombinations from the dictionary Φ = {φ(aT x) : a ∈ Rd}.


Summary

Flexible approximation modelsSubset selectionNonlinearly parameterized bases as with neural nets`1 control on coefficients of combination

Accurate approximation with moderate number of termsProof analogous to random coding

Information theoretic risk boundsBased on the minimum description length principleShows accurate estimation, even for very large dimension

Computational challenges are being addressed byAdaptive annealingNonlinear power methods


Tensor and nonlinear power methods (overview)

Know design distribution p(X )

Target f (x) =∑mo

k=1 gk (aTkx) is a combination of ridge

functions with distinct linearly independent directions ak

Ideal: maximize E [f (X )φ(aTX )] or (1/n)∑

i Yiφ(aTXi)

Score functions operating on f (X ) and f (X ) g(aT X ) yieldpopulation and sample versions of tensors

E[

∂3

∂Xj1∂Xj2∂Xj3f (X )

]and nonlinearly parameterized matrixes

E[(∇∇T f (X ))g(aTX )

]Spectral decompositions then identify the directions ak


Score method for representing expected derivatives

Score function (tensor) S`(X ) of order ` from known p(X )

Sj1,...j`(X ) p(X ) = (−1)`∂`

∂Xj1 · ∂Xj`p(X )

Gaussian score: S1(X ) = X ,

S2(X ) = XX T − I,

S3j1,j2,j3(X ) = Xj1Xj2Xj3 − Xj11j2,j3 − Xj21j1,j3 − Xj31j1,j2 .

Expected derivative:

E[

∂`

∂Xj1 · ∂Xj`f (X )

]= E

[f (X )Sj1,...j`(X )

]Repeated integration by parts


Expected derivatives of ridge combinations

Ridge combination target functions:

f (X ) =mo∑

k=1

gk (aTk X )

Expected Hessian of f (X )

M =mo∑

k=1

akaTk E [g

′′

k (aTk X )] = E

[f (X )S2(X )

].

Principle eigenvector:

maxa

{aT M a

}Linear power method finds ak if othogonal (the’re not).

Third order array (Anandkumar et al 2015, draft):mo∑

k=1

aj1,kaj2,kaj3,kE [g′′′

k (aTk X )] = E

[f (X )Sj1,j2,j3(X )

]can be whitened and a quadratic power method finds ak .


Scoring a Ridge Function

A suitable activation function φ(a,X ) for optimization of

E [f (X )φ(a,X )]

Matrix scoring of a ridge function g(aTX ):

Ma,X = S2g(aTX )+[S1aT +a(S1)T ]g′(aTX )+[aaT ]g′′(aTX )

Activation function formed by scoring a ridge function

φ(a,X ) = aT [Ma,X ]a

= (aT S2a)g(aTX ) + 2(aT S1)(aTa)g′(aTX ) + (aTa)2g′′(aTX )

Scoring a ridge function permits finding the component ofφ(a,X ) in the target function using

E [f (X )φ(a,X )] = aT E [f (X )Ma,X ]a = aT E [(∇∇T f (X ))g(aTX )]a

Twice itegrating by parts


Scoring a Ridge Function (Gaussian design case)

Matrix scoring of a ridge function g(aTX ):

Ma,X = S2g(aTX )+[S1aT +a(S1)T ]g′(aTX )+[aaT ]g′′(aTX )

Activation function formed by scoring a ridge function

φ(a,X ) = aT [Ma,X ]a

= (aT S2a)g(aTX ) + 2(aT S1)(aTa)g′(aTX ) + (aTa)2g′′(aTX )

Gaussian design case, simplifying when ‖a‖ = 1:

φ(aT X ) = [(aTX )2 − 1]g(aTX ) + [2aTX ]g′(aTX ) + g′′(aTX )

φ(z) = (z2−1)g(z) + 2z g′(z) + g′′(z)

Hermite poly: If g(z)=H`−2(z) then φ(z)=H`(z) for ` ≥ 2.


Scored Ridge Function Decomposes E [f (X )φ(a,X )]

Matrix scored ridge function, providing φ(a,X ) = aT Ma,X a,

Ma,X = S2g(aTX ) + [SaT + aST ]g′(aTX ) + [aaT ]g′′(aTX )

The amount of φ(a,X ) in f (X ) via the matrix decomposition

Ma = E [f (X )Ma,X ] = E [(∇∇T f (X ))g(aTX )] =∑mo

k=1 ak aTk Gk (ak ,a)

is quantified by

E [f (X )φ(a,X )] = aT [Ma]a =

m0∑k=1

(aTk a)2Gk (ak ,a)

Here Gk (ak ,a) = E [g′′k (aTkX )g(aTX )] measures the

strength of the match of a to the direction ak .

It replaces E [g′′k (aTkX )ST ]a = (aT

k a)E [g′′′k (aTkX )] in the

tensor method of Anandkumar et al


Using Sinusoids or Sigmoids


Ma = E [f (X )Ma,X ] =∑mo


quantified by

E [f (X )φ(a,X )] = aT [Ma]a =

m0∑k=1

(aTk a)2Gk (ak ,a)

Here Gk (ak ,a) = E [g′′k (aTkX )g(aTX )] measures the


cos(z), sin(z) case, with X standard multivariate Normal:

gk (aTkX ) = −ckei aT

k X and g(aTX ) = e−i aTX

expected product Gk (ak ,a) = cke−(1/2)‖ak−a‖2

Step sigmoid case φ(z) = 1{z>0}: The Gk (ak ,a) isdetermined by the angle between ak and a.


Using Hermite polynomials


Ma = E [f (X )Ma,X ] =∑mo


is given by

E [f (X )φ(a,X )] = aT [Ma]a =

m0∑k=1

(aTk a)2Gk (ak ,a)

Here Gk (ak ,a) = E [g′′k (aTk X )g(aT X )] measures the


Hermite case: g(z) = H`−2(z), with X ∼ Normal(0, I).H`(aTX ) and H`′(aT

kX ) orthonormal for `′ 6= `.

Gk (ak ,a) = ck ,` (aTka)`

with ck ,` = E [gk (Z )H`(Z )] in gk (z) =∑

`′ ck ,`′H`′(z)


Nonlinear Power Method

Maximize J(a) = E [f (X )φ(a,X )] = aT Ma a, s.t. ‖a‖=1

Cauchy-Schwartz inequality:

aT Ma a ≤ ‖a‖ ‖Maa‖

with equality iff a is proportional to Maa.Motivates the mapping of the nonlinear power method

V (a) =Ma a‖Maa‖

Seek fixed points a∗ = V (a∗) via iterations at = V (at−1).Construct a whitened version.Verify that J(at ) is increasing.The nonlinear power method provides maximizers of

J(a) = E [f (X )φ(a,X )]


Analysis with Whitening

Suppose mo ≤ d (# components ≤ dimension)

Let Ref =∑

k akaTk βk be a reference matrix,

for instance Ref = Maref has βk = Gk (ak ,aref ),and let Q D QT be its eigen-decomposition.

Let W = QD−1/2 be the whitening matrix:

I = W T Ref W =∑

k

(W Tak )(aTk W )βk =

∑k

αk αTk

with orthonormal directions

αk = W Tak√βk

Represent a = W u/‖Wu‖ = W u√β for unit vectors u.

Then aTak = uTαk (β/βk )1/2

Let uref be the unit vector prop to W−1aref = D1/2QTaref


Analysis of the Nonlinear Power Method

Criterion E [f (X )φ(a,X )] = aT Ma a = uT Mu u where

Mu =∑

k αkαTk Gk (αk ,u)β/βk

and Gk is Gk with ak ,a expressed via αk ,u. Example

Gk (αTk u) = ck ,` (αT

k u)` (β/βk )`/2

Mu =∑

k αkαTk (αT

ku/αTkuref )`

The power mapping at = Mat−1at−1/‖ · ‖ corresponds to

ut = Mut−1ut−1/‖ · ‖

Provably rapidly convergent, when Gk is increasing in αTku.

Limit of ut is u∗ = ±αk with largest initial (αTku0/α

Tkuref )`.

Each +αk or −αk is a local maximizer.Global maximizer corresponds to largest 1/|αT

kuref |Corresponding maximizer of aT Ma a is a∗ prop to Wu∗.


Analysis of Nonlinear Power Method, Polynomial Case

Let ck (t) = αTk ut be coefficient of ut in the direction αk

Let ck ,ref = αTk uref be coefficient of uref in direction αk

Mut =∑

k αk αTk (αT

kut/αTkuref )`

So thatMut ut =

∑k αk (αT

k ut ) (αTkut/α

Tkuref )`

Thus the coefficienct for ut+1 satisfies the recursion:

ck (t +1) =

[ck (t)/ck ,ref

]`+1 ck ,ref

[∑

k ( )2]1/2

By induction

ck (t) =

[ck (0)/ck ,ref

](`+1)tck ,ref

[∑

k ( )2]1/2

It rapidly concentrates on the index k with the largest

ck (0)

ck ,ref=

αTk u0

αTk uref


Analysis of Nonlinear Power Method, Polynomial Case

Suppose k = 1 has the largest

ck (0)

ck ,ref=

αTk u0

αTkuref

with the others less by the factor 1−∆. Then

‖ut − α1‖2 ≤ 2(1−∆)2(`+1)t

Moreover J(at ) = E [f (X )φ(at ,X )] = uTt Mut ut equals∑

k[ck (t)/ck ,ref

]`+2 c2k ,ref

which is strictly increasing in t , proven by applications ofHolder’s inequalityFactor of increase quantified by the exponential of arelative entropy.The increase each step is large unless c2

k (t) is close toconcentrated on the maximizers of αT

k u0/αTk uref .


andrew r. barron yale university department of statisticsarb4/presentations/barronosijek... ·...

Documents