proposal for qualifying exam - wordpress.com · david weber june 6, 2017 exam committee committee...

12
Proposal for Qualifying Exam David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas Strohmer Prof. Yong Jae Lee Prof. James Sharpnack Exam Logistics Date: Friday June 16th, 2017 Time: 11:00 AM Location: MSB 2240 Proposed Research Talk Title. Convolutional Neural Networks, the Scattering Transform, and Extensions Abstract. Convolutional Neural Networks (CNNs) are a type of deep neural network which have performed well at image and audio classification. One approach to understanding the success of CNNs is Mallat’s scattering transform, which formalizes the observation that the filters learned by a CNN have wavelet-like structure. The resulting transform generates a representation that is approximately translation invariant and slowly varying to a wide class of deformations of the input patterns. These features are then used by a linear classifier to perform classification. In this talk, we will explore the application of the scattering transform and a new variant to understanding the classification of objects using sonar. As we have an explicit model with a fast simulator for this problem, it provides a good test problem for understanding the properties of the scattering transform in the object domain. We obtain over 95% classification accuracy on binary discrimintating shape and speed in the case of synthetic data generated by the fast simulator, and a 95% detection rate for unexploded ordinance, with a false positive of 25% in the case of real data. We construct a new variant of the scattering transform, the shearlet scattering transform (shattering transform), using the shearlet frame instead of Morlet wavelets that are often used in the standard scattering transform. This 2D transform has the advantage of sparsifying the signals containing curvilinear singularities rather than point singularities. Introduction Deep neural networks, and convolutional neural networks in particular have proven quite effective at discerning hierarchical patterns in large datasets [LeCun et al., 2015]. Some examples include image classification [Krizhevsky et al., 2012], face recognition [Sun et al., 2013], speech recognition [Dahl et al., 2013], and the game of Go [Silver et al., 2016], among many others. Clearly, something is going very right in the design of CNNs. However, the principles that account for this success are still somewhat elusive. Briefly, a CNN is a deep neural network where each layer is composed of a number of convolutions. Beyond this broad shared 1

Upload: others

Post on 17-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

Proposal for Qualifying Exam

David Weber

June 6, 2017

Exam Committee

Committee Chairperson:

Prof. James Bremer

Committee Members:

Prof. Naoki Saito

Prof. Thomas Strohmer

Prof. Yong Jae Lee

Prof. James Sharpnack

Exam Logistics

Date:

Friday June 16th, 2017

Time:

11:00 AM

Location:

MSB 2240

Proposed Research Talk

Title. Convolutional Neural Networks, the Scattering Transform, and Extensions

Abstract. Convolutional Neural Networks (CNNs) are a type of deep neural network which have performedwell at image and audio classification. One approach to understanding the success of CNNs is Mallat’sscattering transform, which formalizes the observation that the filters learned by a CNN have wavelet-likestructure. The resulting transform generates a representation that is approximately translation invariantand slowly varying to a wide class of deformations of the input patterns. These features are then usedby a linear classifier to perform classification.

In this talk, we will explore the application of the scattering transform and a new variant to understandingthe classification of objects using sonar. As we have an explicit model with a fast simulator for thisproblem, it provides a good test problem for understanding the properties of the scattering transformin the object domain. We obtain over 95% classification accuracy on binary discrimintating shapeand speed in the case of synthetic data generated by the fast simulator, and a 95% detection rate forunexploded ordinance, with a false positive of 25% in the case of real data.

We construct a new variant of the scattering transform, the shearlet scattering transform (shatteringtransform), using the shearlet frame instead of Morlet wavelets that are often used in the standardscattering transform. This 2D transform has the advantage of sparsifying the signals containingcurvilinear singularities rather than point singularities.

Introduction

Deep neural networks, and convolutional neural networks in particular have proven quite effective at discerninghierarchical patterns in large datasets [LeCun et al., 2015]. Some examples include image classification[Krizhevsky et al., 2012], face recognition [Sun et al., 2013], speech recognition [Dahl et al., 2013], and thegame of Go [Silver et al., 2016], among many others. Clearly, something is going very right in the design ofCNNs. However, the principles that account for this success are still somewhat elusive. Briefly, a CNN is adeep neural network where each layer is composed of a number of convolutions. Beyond this broad shared

1

Page 2: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

framework, there are many variations, such as the shape of the convolution, the introduction of max or minpooling, the depth, the kinds of nonlinearities, etc. Each of these features can be varied between layers in adizzying array of parameters and hyper-parameters.

The idea of a CNN is not new; its predecessor, the Neocognitron, was invented in 1980 by K Fukushima[Fukushima, 1980]. Its structure was inspired by the structure of the visual system, in particular V1, wherethe layers of simple and complex cells form columns of locally connected structure, which are responsive toincreasingly invariant representations [Hubel and Wiesel, 1968]. The first part of the visual cortex, precedingV1, has receptive fields that consist of regions of activation surrounded by inactivation. The simple cells,consist of edge detectors, where each cell has a particular location and orientation [Hubel et al., 1995]. Thecomplex cells have less locational sensitivity, and a sensitivity to motion.

The ties between wavelets and the visual system have been developed for decades. In the early 80’s,several authors demonstrated that the simple cells are well modeled by Gabor filters (e.g. [Daugman, 1980][Marcelja, 1980]). This has motivated the use of images generated using Gabor filters to determine propertiesof human visual processing; for example Field used Gabor filters to demonstrate that human vision relies oncontinuity to detect features [Field et al., 1993]. In the 90’s, this Gabor-like structure was experimentallydemonstrated to develop sparsity [Field, 1987] [Olshausen and Field, 1996].

Harmonic analysis has come and gone in the neural network literature, much like the field as a whole;in the 90’s, when networks of only one hidden layer were the norm, there was some success in developingwavelet neural networks, where either the data was first processed with wavelets, or the dilation andtranslation parameters for the wavelets were trained simultaneously with the neuron weights [Szu et al., 1992][Zhang, 1997].

In 2012, Stephane Mallat and Joan Bruna published both theoretical results [Mallat, 2012] and numericalimplementations [Bruna and Mallat, 2011] tying together convolutional neural networks and wavelet theory.They demonstrated that scattering networks of wavelets and modulus nonlinearities, are translation invariant inthe limit of infinite scale, and Lipschitz continuous under non-uniform translation, i.e. Tτ (f)(x) = f

(x−τ(x)

)

for τ with bounded gradient. Numerically, they achieved state of the art on image and texture classificationproblems.

More recent work from Wiatowski and Bolcskei at ETH Zurich have generalized the Lipschitz continuityresult from wavelet transforms to frames, and more importantly, established that increasing the depth ofthe network also leads to translation invariant features [Wiatowski and Bolcskei, 2015]. There have beena number of follow up papers, inluding a discrete version of Wiatowski’s result [Wiatowski et al., 2016],a related method on graphs [Chen et al., 2014], and an inverse problem used to solve the phase retrievalproblem [Waldspurger et al., 2015]. There have been a number of papers using the scattering transformin such problems as fetal heart rate classification [Chudacek et al., 2014], age estimation from face images[Chang and Chen, 2015], and voice detection in the presence of transient noise [Dov and Cohen, 2014].

Figure 1: Example targets on shoreand in the target environment, from

[Kargl, 2015]

In this talk, I will apply both the scattering transform of Mallatet al. and a version based off of the theory of Wiatowski et al.to the problem of classification of underwater objects using sonar.This is motivated by previous work done by Professor Saito on thedetection of unexploded ordinance using synthetic aperture sonar[Marchand et al., 2007]. Coming from this, we have a dataset of14 different objects at various distances and rotations, partiallysubmerged on the ocean floor, as shown in Fig. 1. Additionally, wecan generate synthetic examples using a fast solver for the Helmholtzequation in two regions of differing speed, provided by Dr. Ian Sammisand Professor Bremer. We use this to examine more closely thedependence of both classification and the output of the scatteringtransform on both material properties and shape variations.

The approximate translation invariance and Lipshitz continuityunder small diffeomorphisms of the scattering transform mean thatfor classes which are invariant under these transformations, members

of the same class will be close together in the resulting space. So long as transforming from one class to anotherrequires a τ with large derivative, then the classes will be well separated. Accordingly we use a linear classifieron the output of the scattering transform. Additionally, because the scattering transform concentrates energy

2

Page 3: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

at lower scales and wavelets in general encourage sparsity for smooth signals with singularities, we use asparse version of logistic regression as our linear classifier called LASSO [Hastie et al., 2015]. Specifically, weuse a julia wrapper around the glmnet code of [Friedman et al., 2010]. As a baseline classification schemeto compare the classification performance, we use logistic regression on the absolute value of the Fouriertransform of the signal.

Deep Convolutional Neural Networks

There are many ways to put together a CNN. For simplicity of presentation, we will consider data U0 ∈ RD0×1

which is 1D, such as audio signals. Instead of a vector, U0 could be a matrix (images), or a tensor(volumetric data), etc. with only minor changes. At layer m, there are Km filters, or receptive fields,

GimKmi=1 ⊂ RDm×Km−1 . A small number of adjacent rows of Gim are non-zero, corresponding to small localsupport in the input. There is some Lipschitz continuous nonlinearity Mm applied elementwise, such as arectified linear unit (ReLU), hyperbolic tangent, or logistic sigmoid that is to the output of the convolution.Finally, there is some pooling function Rm : RDm−1×Km → RDm×Km , which reduces Dm ≈ Dm−1/2, often byreturning only the maximum of 21 elements, or their mean. This plays the role of either striding or poolingin the literature. Putting all of these together, we can recursively calculate the matrix Um at layer m via

Um = Rm

(Mm

(G1m ?c Um−1

), · · · ,Mm

(GKmm ?c Um−1

)).

where by ?c performs convolution occurs strictly along the columns, and then sums the result, i.e. ifg1, . . . gKm−1

are the columns of Gim and u1, . . . , uKm−1the columns of Um−1, then

G1m ?c Um−1 :=

Km−1∑

i=1

gi ? ui (1)

The weights in the receptive fields Gim are learned, but the support size is fixed. A single row U im correspondsto choosing a path consisting of a single filter from each layer.

Say we have M layers, with the last layer being fully connected instead of having the above structure,and we have N examples U j0Nj=1 ⊂ RD0×1 with labels yjKj=1 ⊂ RDM×1, where DM is the number ofclasses. A commonly used objective is [Krizhevsky et al., 2012]

minGNI

K∑

j=1

‖yj − UM‖22 + γ

N∑

n=1

DN∑

i=1

‖Gni ‖22 (2)

The first term corresponds to minimizing the error, while the second is a constraint on the total weight size;[Krizhevsky et al., 2012] found the inclusion of the weight constraint essential to training.

Previous Results

A generalized scattering network (hereafter referred to simply as a scattering network), as defined in[Wiatowski and Bolcskei, 2015]2, has an architecture that is a continuous analog of a CNN. For a diagram,see Fig. 4. First, at layer m, we start with a family of generators gλmλm∈Λm ⊂ L1(Rd)

⋂L2(Rd) for atranslation invariant frame Ψm = ψb,λmb∈Zd,λm∈Λm = TtIgλmb∈Zd,λm∈Λm with frame bounds am and bm,that is

am‖f‖22 ≤∑

λm∈Λm,b∈Zd

∣∣〈TtIgλm , f〉∣∣2 ≤ bm‖f‖22

for all f ∈ L2(Rd). Here, Tt[f ](x) = f(x− t) is the translation operator, I[f ](x) = f(−x) is the involutionoperator, and Λm is some countable discrete index set, such as Z. This index set needs to tile the frequencyplane in some way, for example by indexing scales and rotations. The frames gλm correspond to the receptive

1when we increase the input dimension, this will go up to 2d for input dimension d2A more thorough exposition can be found at arxiv:1512.06293 [cs.IT].

3

Page 4: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

fields found in each layer, with each gλm corresponding to a Gmi . In addition, at layer m we define aLipschitz-continuous operator mm with bound γm which satisfies mmf = 0 ⇒ f ≡ 0. Finally, we select asub-sampling factor rm ≥ 1.

The scattering transform of [Mallat, 2012] for Rd corresponds to choosing Λm = ajhj>−J,h∈H forsome finite rotation group H, g0,1 = ψ(x) for some mother wavelet ψ, φ = φ its corresponding father waveletφ, and

∣∣ ·∣∣ as the Lipschitz nonlinearity. However, it lacks a subsampling factor, so rm = 1. Explicitly, the

generator corresponding to index λ = (j, h) is gλ = adjψ(ajh−1x).To get from layer m− 1 to layer m, we define the function um : Λm × L2(Rd)→ L2(Rd) by

um[λm](f(z)

)= mm[f ? gλm ]

(rmz

)

Using this to define the output at a layer coming along a path of indicies q ∈ Λm := Λ1 × . . .×Λm. Then theoutput at layer m is given by the path operator u, defined recursively using ui by

u[q](f) = um[λm]un−1[λn−1] · · ·u1[λ1]f

Choosing d = 1 corresponds to audio as in the CNN section. Unlike a CNN, a scattering network has infinitedepth, and thus each layer has output. The output is created by choosing some λ∗m ∈ Λm, and defining theaveraging function χn−1 := gλ∗

m, which is removed from the set of generating functions. The resulting set of

features at the nth layer is Φm[f ] :=

u[q]f ? χm

q∈Λnand for the entire network Φ[f ] :=

∞⋃n=0

Φm[f ].

The chief difference between a scattering transform and a CNN is the summation that occurs acrossfilters in a CNN as in Eq. (1). In a scattering transform, each filter is only applied to a specific path of filtersin the layer below, generating a tree-like structure; it is possible to arrive at this in a CNN by simply fixingall but one column to 0.

There are two additional conditions that restrict the various operators in a given layer simultaneously.The first is the weak admissibility condition, which requires that the upper frame bound bm be sufficientlysmall compared to the sub-sampling factor and Lipschitz constants:

max

bm,

bmγ2m

rdm

≤ 1 (3)

this can be achieved by scaling the gm, and suggests a possible role of the weight constraint in Eq. (2).The second is that the nonlinearities must commute with the translation operator, sommTt[f ] = Ttmm[f ].

Most nonlinearities used for CNN’s are pointwise i.e., mm[f ](x) = ρm(f(x)), so they certainly commute withTt. Given these constraints, we can now state the results of [Wiatowski and Bolcskei, 2015] precisely; thefirst is that the resulting features Φm deform stably with respect to small frequency and space deformations:

Theorem 1. For frequency shift ω ∈ C(Rd,R) and space shift τ ∈ C1(Rd), define the operator Fτ,ω[f ](x) :=e2πiω(x)f

(x − τ(x)

). If ‖Dτ‖∞ ≤ 1

2d , then there exists a C > 0 independent of the choice of Φ s.t. for allf ∈ L2

a(Rd),

∥∥Φ[Fτ,ωf ]− Φ[f ]∥∥

2:=

∞∑

n=1

q∈Λn

∥∥χn ? u[q](Fτ,ωf)− χm ? u[q]f∥∥

2≤ C(r‖τ‖∞ + ‖ω‖∞)‖f‖2 (4)

where L2a(Rd) is the set of L2(Rd) functions whose Fourier transforms are band limited to [−a, a]. For the

proof, see [Wiatowski and Bolcskei, 2015] Appendix F, G and H; first they show that Φm is non-expansive,and then they show that ‖f −Fω,τ‖ is bounded above by the upper bound in Eq. (4) using Schur’s lemma forbounded integral operators. Together these imply the theorem. Mallat shows a similar bound for the specificcase that mm = | · | and gm is generated by an admissible mother wavelet ψ with a number of conditions[Mallat, 2012] (see Theorem 2.6 for the admissibility conditions).

Their next result, translation invariance that increases with depth, is distinct from the one shown byMallat where translation invariance increases with resolution:

Theorem 2. Given the conditions above, for f ∈ L2(Rd) the nth layer’s features satisfy

Φm[Ttf]

= T tr1···rm

[Φm(f)

](5)

4

Page 5: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

If there is also a global bound K on the decay of the Fourier transforms of the output features χm:

|χm(ω)| ≤ K

then we have the stronger result

n∑

k=1

∥∥Φm[Ttf ]− Φm[f ]∥∥

2≤ 2π|t|Kr1 · · · rm

‖f‖2.

The proof for Eq. (5) follows almost immediately from the condition that mm and Tt commute: firstnote that

um[λm]Ttf(x) = mm[Ttf ? gλm ](rmx) = mm[Tt(f ? gλm)](rmx)

= mm[(f ? gλm)](rmx− t) = mm[(f ? gλm)](rm(x− t

rm))

= Tt/rmum[λm]

Then Eq. (5) follows by repeated application along the path q. For the second result see Appendix I and E of[Wiatowski and Bolcskei, 2015]. This is more technical, and relies on Parseval’s formula, the previous result,the operator norm of um[λm], and the weak admissibility condition.

To compare with the result from [Mallat, 2012], that says that for an admissible mother wavelet ψ, thescattering transform achieves perfect translation invariance as the lower bound on the scale goes to infinity:

limJ→∞

‖ΦJ [Ttf ]− ΦJ [f ]‖ = 0

Sonar Detection

The problem that we will be investigating with the scattering transform techniques is the classification ofobjects partially buried on the sea floor using sonar data. This is motivated by using an unmanned underwatervehicle (UUV) equipped with sonar to detect unexploded ordinance. For this problem, we have both real andsynthetic examples. The real examples consist of 14 objects buried at various distances and rotations in awell groomed sandy shallow oceanbed. The path of a UUV was simulated using a rail, and at short intervalsalong it the field of objects was pinged. So for each rotation of the object relative to the rail, there is a 2Dwavefield, where each signal corresponds to a location on the rail and the observation time.

The synthetic examples come from considering the 2D Helmholtz equation regions of differing speeds:

∆u+ k21u = 0 in Ω

∆v + k22v = 0 in Ωc

u− v = g on ∂Ω

∂νu− ∂νv = ∂νg on ∂Ω√|x|(∂|x| − ik2

)v(x)→ 0 as |x| → ∞,

which gives the response to a sinusoidal signal with frequency ω on an object with ki = ω/ci, where ci is thespeed of sound in the material, ranging from 343m/s in air, to 1503m/s in water, to 5100m/s in aluminum. It isworth noting that this is idealized in several ways: the model we use is 2D, rather than 3D, the material ismodelled as a fluid with only one layer, instead of a solid with multiple different components, and there is norepresentation of the ocean floor itself.

The signals sent out by the UUVs are not pure sinusoids. One can approximate the response tomultifrequency signals (e.g. Gabor functions or chirps) by integrating across frequencies. We use a fast solvercreated by Ian Sammis and James Bremer to synthesize a set of examples, where we can more explicitlytest the dependence of the scattering transform on the material properties (represented by the speed) andgeometry. The current repository we use was created by Vincent Bodin, a former summer intern supervisedby Professor Saito.

5

Page 6: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

There are two ways we have approached classifying these sonar waveforms. Since at each rail locationa 1D waveform is recorded, one can either treat each point in space as a different 1D signal, or gather allpoints on the rail in a single 2D wavefield. In the 1D case, there are far more examples (on the order of1600 per waveform for the real case, or 480 for the synthetic), but less invariant structure. Specifically,translation in the object domain doesn’t correspond to a translation in the signal domain for the 1Dcase. The only reason we can expect any success from the scattering transform is the first theoremabove of Lipschitz change under diffeomorphism. In the 2D case, translation parallel to the observationrail in the object domain corresponds to translation in the signal domain. However, translation notparallel to the rail modulates both amplitude and location in the signal domain. So the 2D case stillrelies on the signals not having too large a diffeomorphism to change between examples in the same class.

0 1 2 3

Sca

le

−150μ

−100μ

−50μ

0

50μ

100μ

150μ

−150μ

−100μ

−50μ

0

50μ

100μ

150μ

−150μ

−100μ

−50μ

0

50μ

100μ

150μ

−150μ

−100μ

−50μ

0

50μ

100μ

150μ

−150μ

−100μ

−50μ

0

50μ

100μ

150μTriangle speed 2000 angle 00

Figure 2: The ST coefficients for a speed 2000m/s. Thedepth is m = 3. Scale coarsens vertically, while within

a given layer, the time increases horizontally. Notethat only paths of increasing scale have been kept for

computational reasons

There are many design choices in testing the effec-tiveness of a classifier. As a baseline to compareagainst, we use the same glmnet logistic classifier onthe absolute value of the Fourier transform, which isa technique that is just a step beyond using a linearclassifier directly on the data (this direct approachdid roughly as well as chance, or 50% in the binarycase). To understand the generalization ability ofthe techniques, we split the data into two halfs, onehalf training set and one half test set, uniformly atrandom 10 times, i.e. 10-fold cross validation.

For the synthetic data, there are two primaryproblems of interest. The first is determining theimportance of varying shape on the scattering trans-form. An example of the 1D scattering transform fora triangle is in Fig. 2. Since the energy at each layerdecays exponentially, layers 1-3 have been scaled tomatch the intensity of the first layer. Note that onlythe zeroth layer has negative values; this is becausethe nonlinearity used by the scattering transform isan absolute value. In the figure, one can clearly see a time concentrated portion of the signal. At deeperlayers, the energy is concentrated at the coarsest scale. The scattering transform was able to successfullydiscriminate between these similar shapes at over a 80% rate, while the AVFT achieved a 70%. Whiletractable, this problem is more difficult than the other problem, determining the importance of material.Even the AVFT was successful on this problem, achieving a 96% classification accuracy. The most importantcoordinate for the AVFT was the average. Initially, the scattering transform had the same results as theAVFT, but increasing the quality factor and the depth of the scattering transform improved its accuracy to99%.

For the real dataset, the problem of interest is somewhat more abiguous. In addition to a set of UXO’sand a set of arbitrary objects, there are some UXO replicas, not all of which are made of the same material.As we saw in the synthetic case, the difference in speed has a much clearer effect on classification accuracythan shape, so it is somewhat ambiguous how to treat these. However, even including these replicas inthe UXO category gave us a false negative rate of 5% and a false positive rate of 24% (or 94% correctlyidentified UXO’s and 75% correctly identified others); in comparison, the AVFT had a false negative rate of9% and a false positive rate of 50% (so no better than random guessing). It is not terribly surprising that thenon-UXO’s are harder to classify as a group using a linear classifier, since they don’t share much in common–a SCUBA tank bears more resemblence to a UXO than it does to a rock.

Shearlets

Previous versions of the scattering transform have relied primarily on Morlet wavelets. A problem with thisapproach is that tensor products of wavelets are best suited to detect point singularities. However, in thesonar problem, and most images, the most relevant singularities are two dimensional curves, meaning that

6

Page 7: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

representations using tensor product wavelets will be much denser than is conceivably necessary. A naturalsolution to this problem are curvelets, which use a rotation operator and a parabolic scaling matrix:

Aj =

(2j 00 2jα/2

)

Figure 3: Absolute value of a discrete shearlet system,where ψi,j,k is in cone i, with scale j and shearing k,

and φ is the averaging function

where α ∈ (0, 2) controls the degree of anisotropy. IfRθ is a rotation by θ, then curvelets are defined byψj,θ(x) = 23j/4ψ(SkAjx) [Candes, 2003]. Curvletshave been shown to be almost optimal in terms ofsparsity for cartoon-like images (those which aresmooth, except along a simple, differentiable curve)[Candes and Donoho, 2004]. However these sufferfrom numerical instability, since discritization doesn’tcommute with the rotation operator for non-trivialrotations. Thus the shearlet was created, which usesshearing matrices, with k ∈ Z and spacing factor c:

Sk =

(1 ck0 1

)

instead of rotation matrices. Then if one chooses ψand φ appropriately, the resulting shearlet system is

ψj,k(x) = 23j/4ψ(SkAjx) , φ(x)

which is convolved with the input to generate the transform. As k → ∞, ψj,k tiles the frequency plane.However, it does so with a directional bias, so when we truncate at finite k, more of the first frequency axisis covered than the second. To counter this, two cones of shearlets are used, with the roles of x1 and x2

reversed; for example, in Fig. 3, ψ1,1,1 and ψ2,1,1 have the same scale and shearing, but different cones. Thecombination of Sk and Aj leave the discritization lattice intact if chosen well. The resulting transform is thecompactly supported universal shearlet transform. The shearlets used in this paper are defined so that theyare approximately classical shearlets:

ψ(ξ1, ξ2) ≈ ψ1(ξ1)ψ2

(ξ2/ξ1

)

j∈Z

∣∣ψ1(2jξ)∣∣2 = 1 for a.e. ξ ∈ R

∣∣ψ2(ξ − 1)∣∣2 +

∣∣ψ2(ξ)∣∣2 +

∣∣ψ2(ξ + 1)∣∣2 = 1 for a.e. ξ ∈ [−1, 1]

i.e. ψ1 satisfies a discrete Calderon condtion and ψ2(ξ − `)`∈Z is a partition of unity. This results in wedge-like essential support in the Fourier domain. Further, the construction means that the shearlets generated byψ are compactly supported, at the cost of only being approximately classical. More details on the constructioncan be found in [Kutyniok et al., 2016], and on shearlets in the text [Kutyniok and Labate, 2012].

Shattering Transform

With Shearlets introduced, we can now define the shattering transform, created by Professor Saito and myself.In each layer of the shattering transform, we use universal shearlet generated by shearlab [Kutyniok et al., 2016].The nonlinearity we use in these examples is the ReLU, which has a Lipschitz constant of 1 when all vari-ables are real, and we subsample by a factor of 2 every layer. Estimating frame bounds can be difficult;[Kittipoom et al., 2012] have provided some estimates for the class used by shearlab. For the purpose of thistalk, we will assume that the frame bound bm ≤ 1, so because the Lipschitz constant of the ReLU is 1, anysubsampling rate larger than 1 will satisfy Eq. (3). In this talk, I will present comparsions between usingthe shattering transform, a tensor product version of Morlet wavelet scattering transform, and the absolutevalue of the Fourier transform on the synthetic data set. Currently, there are implementation issues with theshattering transform on examples as large as the real example.

7

Page 8: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

My Contributions

The first project that I undertook was using the scattering transform as implemented in MATLAB by theMallat lab on the real dataset to classify whether an object is unexploded ordinance or not, using the 1Dapproach. The scattering transform performed significantly better than using the AVFT across 10-fold crossvalidation on the real dataset, as well as on shape and material discrimination in the synthetic case. I createda new way of visualizing the results for the 1D scattering transform, both for selected coordinates and theactual data, as in Fig. 2. In the synthetic case, I have created a Julia wrapper for the Helmholtz equationsolver to facilitate the creation of new examples.

For the 2D case, I wrote the Shearlet scattering transform using Shearlab for the shearlet transform inJulia, and compared its success at classification with a Morlet scattering transform and a 2D AVFT. Onshape discrimination in the synthetic dataset, the shattering transform improved classification from 91% to95%, using 2/3 as many coefficients.

Future Plans

In addition to continued exploration of the shattering transform, there are a number of related problems Ihope to investigate:

One theoretical problem that arises naturally in the synthetic data is the distinction between invarianceproperties in the object domain instead of the signal domain. The results thus far discuss translations anddiffeomorphisms in the signal, but the real invariances of class in the sonar problem arise from translation androtation of the object. To do so may require examining a different diffeomorphism in place of Fω,τ [f ](x) =eω(x)if(x − τ(x)), since amplitude shifts can’t necessarily be written in this form. To see this, consider

f(x) = ex2

, and g(x) = Af(x), with A > 1. Then if we attempt to represent g(x) = Fω,τ [f ](x), then ω(x) = 0,

and τ is one of x±√x2 − logA, which is imaginary for x ∈

(−(logA)1/2, (logA)1/2

). This simple example,

combined with the usefulness of the scattering transform in the sonar problem suggests that the scatteringtransform is robust to a more general diffeomorphism which can represent the transformations in the objectdomain for the sonar scattering problem.

While the Helmholtz equation is particular to sonar scattering, there are other problems which arisefrom observing the result of a PDE, such as photographs, seismic data, or audio. The simplification ofthe Helmholtz equation above serves as one example of a class of inverse problems, where we are trying todetermine categorical information about an object by observing it indirectly. Providing bounds in terms oftransformations in the object domain, rather than the signal domain, more naturally reflects the propertiesused for classification. Generalizing this in some way would speak to the effectiveness of CNNs in imagerecognition.

Another theoretical problem is the lack of lower bounds for the scattering transform. Neither Mallatand his group, nor Wiatowski & Bolcskei give lower bounds on

∥∥Φ[Fω,τf ]−Φ[f ]∥∥, in terms of τ and ω. If we

can do so, this provides guarantees for when different classes will be well separated.An essential step for the shattering transform is to show that it satisfies the frame bounds Eq.

(3). As mentioned above, for the frames from shearlab, there are some loose frame bounds providedin [Kittipoom et al., 2012]. It should be a fairly straightforward problem to adapt these to prove Eq. (3).Additionally, for discrete frames, because the optimal frame bounds correspond to the eigenvalues of theframe operator, this enables us to explicitly test if a given frame satisfies Eq. (3), so long as the dimension issufficiently small [Christensen, 2003].

CNNs and scattering transforms work well on Euclidian domains. However, there is much more datathat is representable on graphs, such as epidemiological problems, trade networks, and fMRI, to name a few.There is an emerging literature of wavelets and harmonic analysis on graphs [Shuman et al., 2013], and anatural extension of the scattering transform is to apply it to graph domains using an existing graph wavelettransform, such as the generalized Haar-Walsh transform of [Irion and Saito, 2014].

8

Page 9: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

U1 J[f]

S1 J[f]

U2 J[f]

S2 J[f]

Um J[f]

Sm J[f]

Layer0

Layer1

Layer2

. . .

Layerm

f(x

)

σ( r

1f?ψ

λ0)

σ( r

2ψλ0?σ(r

1ψλ0?f))

. . .

σ(rmψλ0?σ( rm

−1··

·σ(r

1ψλ0?f)··

·)σ(rmψλN?σ( rm

−1··

·σ(r

1ψλ0?f)··

·)

. . .

σ( r

2ψλN?σ(r

1ψλ0?f))

. . .. . .

σ( r

1f?ψ

λN

)

ψλ0?σ(r

1ψλN?f)

. . .. . .

ψλN?σ(ψλN?f)

. . .. . .

σ( rmψλ0?σm

−1(rm

−1··

·σ(r

1ψλN?f)··

·)

σ( ψλN?σ(rm

−1··

·σ(r

1ψλN?f))

φ?f(x

)

φ?σ( r

1f?ψ

λN(x

))φ?σ(r

1f?ψ

λ0(x

)))

φ?σ( ψλ0?σ(ψλ0?f))

φ?σ( r

2ψλN?σ(r

1ψλN?f))

φ?σ(rmψλN?σ(rm

−1

···σ

(r1ψλ0?f)··

·)φ?σ(rmψλ0?σ(rm

−1

···σ

(r1ψλN?f)··

·)

φ?σ(rmψλ0?σ(rm

−1··

·σ(r

1ψλ0?f)··

·)

φ?σ(r2ψλ0?σ(r1ψλN?f)

φ?σ(rmψλN?σ(rm

−1··

·σ(r

1ψλN?f)··

·)

φ?σ(r2ψλN?σ(r1ψλ0?f))

...

...

...

...

...

...

···

...

...

Fig

ure

4:

Gen

erali

zed

scatt

erin

gtr

an

sform

9

Page 10: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

Proposed Exam Syllabus

Harmonic Analysis (MAT 271):

Fourier Analysis [Hunter and Nachtergaele, 2001] [Mallat, 2009] [Stein and Shakarchi, 2003b]Fourier transform, Fourier series, convergence, DFT, DCT, FFT, Shannon-Nyquest sampling theo-rem, Parseval/Plancherel theorems, Riemann-Lebesgue lemma, discrete & continuous uncertaintyprinciples

Frames [Christensen, 2003]Definition, frames of translates, dual frames, Gabor frames, short time Fourier transform,

Wavelets [Mallat, 2009]Haar transform, Walsh transform, Multi-resolution analysis, Shannon-Littlewood-Paley wavelets,DWT, curvelets, shearlets, wavelet packets

Optimization & Machine learning (MAT 258A, MAT226A, MAT280):

Optimization [Boyd and Vandenberghe, 2004]Lagrangian Duality, KKT conditions, linear programming, conjugate gradient, quasi-Newton,BFGS, augmented Lagrangian, Wolfe conditions.

Numerical Linear Algebra [Horn and Johnson, 2012]QR factorization, LU decomposition, SVD, conditioning, algorithmic stability, Cholesky factoriza-tion, Krylov subspace, power method for eigenvalues

Machine learning [Bishop, 2006] [Hastie et al., 2015]Support Vector Machines, Logistic Regression, Generative vs Discriminative models (linear dis-criminant analysis), neural networks, mutual information, relative entropy, Markov random field

Mathematical foundations of Big Data : (MAT280 Strohmer notes)Principal component analysis, concentration of measure, clustering/community detection tech-niques, compressed sensing, lasso, basic spectral graph theory, diffusion maps

Analysis (MAT201AB, MAT205):

Real [Rudin, 1987][Hunter and Nachtergaele, 2001]Banach spaces, Hilbert spaces, dominated convergence theorem, monotone convergence theorem,reproducing kernel Hilbert spaces.

Complex [Stein and Shakarchi, 2003a]:Cauchy’s theorem, residue theorem, holomorphic, meromorphic, argument principle, Paley-Wienertheorem, Gamma function, conformal mapping

Probability & Statistics (MAT235AB) [Kallenberg, 1997]:Basic inequalities (Markov’s, Chebyshev’s, etc), Borel-Cantelli lemma, Markov process, convergence inprobability vs distribution vs Lp, strong law of large numbers, Portmanteau theorem

References

[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. 10

[Boyd and Vandenberghe, 2004] Boyd, S. P. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press,New York. 10

[Bruna and Mallat, 2011] Bruna, J. and Mallat, S. (2011). Classification with scattering operators. In 2011 IEEE Conf. onComputer Vision and Pattern Recognition, pages 1561–1566. 2

[Candes, 2003] Candes, E. J. (2003). What is...a curvelet? Notices of the American Mathematical Society, 50(11):1402–1403. 7

[Candes and Donoho, 2004] Candes, E. J. and Donoho, D. L. (2004). New tight frames of curvelets and optimal representationsof objects with piecewise C2 singularities. Communications on Pure and Applied Mathematics, 57(2):219–266. 7

10

Page 11: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

[Chang and Chen, 2015] Chang, K.-Y. and Chen, C.-S. (2015). A learning framework for age rank estimation based on faceimages with scattering transform. IEEE transactions on image processing, 24(3):785–798. 2

[Chen et al., 2014] Chen, X., Cheng, X., and Mallat, S. (2014). Unsupervised deep Haar scattering on graphs. Advances inNeural Information Processing Systems 27, pages 1709–1717. 2

[Christensen, 2003] Christensen, O. (2003). An Introduction to Frames and Riesz Bases. Birkhauser, Boston. 8, 10

[Chudacek et al., 2014] Chudacek, V., Anden, J., Mallat, S., Abry, P., and Doret, M. (2014). Scattering transform forintrapartum fetal heart rate variability fractal analysis: A Case-Control study. IEEE Transactions on Biomedical Engineering,61(4):1100–1108. 2

[Dahl et al., 2013] Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Improving deep neural networks for LVCSR usingrectified linear units and dropout. In 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pages 8609–8613. 1

[Daugman, 1980] Daugman, J. G. (1980). Two-dimensional spectral analysis of cortical receptive field profiles. Vision research,20(10):847–856. 2

[Dov and Cohen, 2014] Dov, D. and Cohen, I. (2014). Voice activity detection in presence of transients using the scatteringtransform. In 2014 IEEE 28th Conv. of Electrical Electronics Engineers in Israel, pages 1–5. 2

[Field, 1987] Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells.Journal of the Optical Society of America. A, Optics and Image Science, 4(12):2379–2394. 2

[Field et al., 1993] Field, D. J., Hayes, A., and Hess, R. F. (1993). Contour integration by the human visual system: Evidencefor a local “association field”. Vision Research, 33(2):173–193. 2

[Friedman et al., 2010] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear modelsvia coordinate descent. Journal of Statistical Software, 33(1):1–22. 3

[Fukushima, 1980] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of patternrecognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202. 2

[Hastie et al., 2015] Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity - The Lasso andGeneralizations. Chapman & Hall/CRC Press. 3, 10

[Horn and Johnson, 2012] Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis. Cambridge University Press. 10

[Hubel et al., 1995] Hubel, D. H., Wensveen, J., and Wick, B. (1995). Eye, Brain, and Vision. Scientific American Library NewYork. 2

[Hubel and Wiesel, 1968] Hubel, D. H. and Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striatecortex. The Journal of physiology, 195(1):215–243. 2

[Hunter and Nachtergaele, 2001] Hunter, J. K. and Nachtergaele, B. (2001). Applied Analysis. World Scientific, Hackensack,NJ. 10

[Irion and Saito, 2014] Irion, J. and Saito, N. (2014). The generalized Haar-Walsh transform. In 2014 IEEE Workshop onStatistical Signal Processing (SSP). 8

[Kallenberg, 1997] Kallenberg, O. (1997). Foundations of Modern Probability. Springer Series in Statistics. Probability and ItsApplications. Springer-Verlag. 10

[Kargl, 2015] Kargl, S. (2015). Acoustic response of underwater munitions near a sediment interface: Measurement modelcomparisons and classification schemes. SERDP and University of Washington Seattle Applied Physics Lab. 2

[Kittipoom et al., 2012] Kittipoom, P., Kutyniok, G., and Lim, W.-Q. (2012). Construction of compactly supported shearletframes. Constructive Approximation, 35(1):21–72. 7, 8

[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutionalneural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural InformationProcessing Systems 25, pages 1097–1105. Curran Associates, Inc. 1, 3

[Kutyniok and Labate, 2012] Kutyniok, G. and Labate, D., editors (2012). Shearlets - Multiscale Analysis for MultivariateData. Applied and Numerical Harmonic Analysis. Birkhauser Basel. 7

[Kutyniok et al., 2016] Kutyniok, G., Lim, W.-Q., and Reisenhofer, R. (2016). ShearLab 3D: Faithful Digital Shearlet TransformsBased on Compactly Supported Shearlets. ACM transactions on mathematical software, 42(1):5:1–5:42. 7

[LeCun et al., 2015] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444. 1

[Mallat, 2009] Mallat, S. (2009). A Wavelet Tour of Signal Processing. Academic Press. 10

[Mallat, 2012] Mallat, S. (2012). Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398. 2, 4, 5

[Marcelja, 1980] Marcelja, S. (1980). Mathematical description of the responses of simple cortical cells. Journal of the OpticalSociety of America, 70(11):1297–1300. 2

[Marchand et al., 2007] Marchand, B., Saito, N., and Xiao, H. (2007). Classification of objects in synthetic aperture sonarimages. In 2007 IEEE/SP 14th Workshop on Statistical Signal Processing, pages 433–437. 2

[Olshausen and Field, 1996] Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties bylearning a sparse code for natural images. Nature, (381):607–609. 2

[Rudin, 1987] Rudin, W. (1987). Real and Complex Analysis. McGraw-Hill. 10

11

Page 12: Proposal for Qualifying Exam - WordPress.com · David Weber June 6, 2017 Exam Committee Committee Chairperson: Prof. James Bremer Committee Members: Prof. Naoki Saito Prof. Thomas

[Shuman et al., 2013] Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A., and Vandergheynst, P. (2013). The emerging fieldof signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEESignal Processing Magazine, 30(3):83–98. 8

[Silver et al., 2016] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural networksand tree search. Nature, 529(7587):484–489. 1

[Stein and Shakarchi, 2003a] Stein, E. and Shakarchi, R. (2003a). Complex Analysis. Princeton University Press, Princeton,N.J. 10

[Stein and Shakarchi, 2003b] Stein, E. and Shakarchi, R. (2003b). Fourier Analysis : An Introduction. Princeton Univ. Press,Princeton, NJ u.a. 10

[Sun et al., 2013] Sun, Y., Wang, X., and Tang, X. (2013). Deep convolutional network cascade for facial point detection. InProc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pages 3476–3483. 1

[Szu et al., 1992] Szu, H. H., Telfer, B. A., and Kadambe, S. L. (1992). Neural network adaptive wavelets for signal representationand classification. Optical Engineering, 31(9):1907–1916. 2

[Waldspurger et al., 2015] Waldspurger, I., D’aspremont, A., and Mallat, S. (2015). Phase recovery, MaxCut and complexsemidefinite programming. Mathematical Programming, 149(1-2):47–81. 2

[Wiatowski and Bolcskei, 2015] Wiatowski, T. and Bolcskei, H. (2015). Deep convolutional neural networks based on semi-discrete frames. In 2015 IEEE Int. Symp. on Information Theory, pages 1212–1216. 2, 3, 4, 5

[Wiatowski et al., 2016] Wiatowski, T., Tschannen, M., Stanic, A., Grohs, P., and Bolcskei, H. (2016). Discrete deep featureextraction: A theory and new architectures. In Proc. of the Int. Conf. on Machine Learning (ICML). 2

[Zhang, 1997] Zhang, Q. (1997). Using wavelet network in nonparametric estimation. IEEE Transactions on Neural Networks,8(2):227–236. 2

12