convolutional neural networks analyzed via convolutional ... · convolutional neural networks (cnn)...

1
Convolutional Neural Networks The forward pass algorithm (without pooling): Mathematically: Training stage: Consider the task of classification, for example. Given a set of signals j j and their corresponding labels h j j , the CNN learns an end-to-end mapping. Motivation and Goals Convolutional neural networks (CNN) lead to remarkable results in many fields Clear and profound theoretical understanding is still lacking Sparse representation is a powerful model Enjoys from a vast theoretical study, supporting its success Recently, convolutional sparse coding (CSC) has also been analyzed thoroughly There seems to be a relation between CSC and CNN: Both have a convolutional structure Both use a data driven approach for training the model The most popular non-linearity employed in CNN, called ReLU, is known to be connected to sparsity (and shrinkage) Layered Thresholding Background: Going Deeper: Multi-Layer CSC Model Deep Coding Problem: Deep Learning Problem: {vardanp,yromano,elad}@{campus,tx,cs}.technion.ac.il Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Vardan Papyan* The Computer Science Department Yaniv Romano* The Electrical Engineering Department Michael Elad The Computer Science Department Technion – Israel Institute of Technology *Contributed Equally This research was supported by the European Research Council under EU’s 7th Framework Program, ERC Grant agreement no. 320649. Uniqueness of λ Stability of Layered Thresholding Thus far, our analysis relied on the local sparsity of the underlying solution , which was enforced through the 0,∞ norm, while the noise was treated globally In what follows, we present stability guarantees for the forward pass of CNN that will also depend on the local energy in the noise vector E This will be enforced via the 2,∞ norm, defined as: Limitations of the Forward Pass Even in the noiseless case, it is incapable of recovering the solution of the DCP Its success depends on the ratio i min / i max . This is a direct consequence of the forward pass algorithm relying on the simple thresholding operator The distance between the true sparse vector and the estimated one increases exponentially as function of the layer depth What’s Next? Layered Basis Pursuit (BP) Thresholding is the simplest pursuit known in the field of sparsity What about replacing the thresholding with the basis pursuit: Stability of Layered Basis Pursuit Conclusion The relation between CNN and the sparse-land model has been defined via our multi-layer convolutional sparse coding model We have shown that the forward pass of CNN is in fact a pursuit algorithm that aims to decompose the signals belonging to our model into their building blocks Leveraging this connection, we were able to attribute to the CNN architecture theoretical claims such as uniqueness of the representations (feature maps) throughout the network and their stable estimation, all guaranteed under local sparsity conditions Sitting on these theoretical grounds, we have proposed a better pursuit that is shown to be theoretically superior to the forward pass λ : min i i=1 K , h j , , j , j λ : Find a set of representations satisfying = 1 1 1 0,∞ s ≤λ 1 1 = 2 2 2 0,∞ s ≤λ 2 K−1 = K K K 0,∞ s ≤λ K : Deepest representation obtained by solving the DCP Extension of the classic sparse representation, suggest modeling signals as hierarchal sparse representations T T Thresholding algorithm: The layered (soft nonnegative) thresholding and the forward pass algorithm are the same !!! The problem solved by the training stage of CNN and the DLP are equal assuming that the DCP is approximated via the layered thresholding algorithm Theorem: If a set of solutions i i=1 K is found for ( λ ) such that: i 0,∞ s ≤λ i < 1 2 1+ 1 μ i Then these are necessarily the unique solution to this problem. The feature maps CNN aims to recover are unique Background: [Papyan et al. (’16)]: If a solution is found for the first layer such that: 0,∞ s < 1 2 1+ 1 μ Then this is necessarily the unique solution Mutual Coherence: μ = max i≠j |d i T d j | Handling Noise λ So far, we have assumed an ideal signal However, in practice we usually have =+ where is due to noise or model deviations λ : Find a set of representations satisfying 1 1 2 ≤ℰ 0 1 0,∞ s ≤λ 1 1 2 2 2 ≤ℰ 1 2 0,∞ s ≤λ 2 K−1 K K 2 ≤ℰ K−1 K 0,∞ s ≤λ K Theorem: If the true representations i i=1 K satisfy i 0,∞ s ≤λ i < 1 2 1+ 1 μ i And the error thresholds for ( λ ) are 0 2 = 2 2 , i 2 = 4ℰ i−1 2 1− 2 i 0,∞ s −1μ i Then the set of solutions i i=1 K obtained by solving this problem must be close to the true ones i i 2 2 ≤ℰ i 2 The problem CNN aims to solve is stable under certain conditions Background: [Papyan et al. (’16)]: If the true representation of the first layer satisfies 0,∞ s =k< 1 2 1+ 1 μ Then a solution for 0,∞ ϵ must be close to it 2 2 2 1− 2k − 1 μ Theorem: If i 0,∞ s < 1 2 1+ 1 μ i i min i max 1 μ i ε L i−1 i max Then the layered soft thresholding will* 1. Find the correct supports 2. i LT i 2,∞ p ≤ε L i We have defined ε L 0 = 2,∞ p and ε L i = i 0,∞ p ε L i−1 i i 0,∞ s −1 i max + β i * For correctly chosen thresholds In the case of hard thresholding: The threshold β i does not increase the error ε L i ε L 0 = 2,∞ p = max i i 2 i : patch extraction operator The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded Deconvolutional networks [Zeiler et al. ‘10] Guarantee for the Success of Layered BP The layered BP can retrieve the representations in the noiseless case, a task in which the forward pass fails Its success does not depend on the ratio i min / i max Theorem: If a set of solutions i i=1 K of ( λ ) satisfy i 0,∞ s ≤λ i < 1 2 1+ 1 μ i Then the layered BP is guaranteed to find them. Background: [Papyan et al. (’16)]: If a solution of the first layer satisfies: 0,∞ s < 1 2 1+ 1 μ Then global BP is guaranteed to find it Theorem: Assuming that i 0,∞ s < 1 3 1+ 1 μ i Then we are guaranteed that* 1. The support of i LBP is contained in that of i 2. i LBP i 2,∞ ≤ε L i 3. Every entry in i greater than ε L i / i 0,∞ p will be found ε L i = 7.5 i 2,∞ j 0,∞ p i j=1 * For correctly chosen ξ i i=1 K Background: [Papyan et al. (’16)] For the first layer, if ξ=4 2,∞ p and 0,∞ s < 1 3 1+ 1 μ , then: 1. The support of BP is contained in that of Γ 2. BP ≤ 7.5 2,∞ p 3. Every entry greater than 7.5 2,∞ p will be found 4. BP is unique. Layered Basis Pursuit (Lagrangian) 2 LBP = min 2 1 2 1 LBP 2 2 2 2 2 2 1 1 LBP = min 1 1 2 1 1 2 2 1 1 1 Layered Iterative Thresholding 2 t = ξ 2 /c 2 2 t−1 + 1 c 2 2 T 1 2 2 t−1 1 t = ξ 1 /c 1 1 t−1 + 1 c 1 1 T 1 1 t−1 Can be seen as a recurrent neural network [Gregor & LeCun ‘10] Convolutional Sparse Coding = i = i (2 − 1) i = stripe-dictionary Stripe-vector i-th patch 0,∞ : min 0,∞ s s. t. = 0,∞ s ≜ max i i 0 0,∞ ϵ : min 0,∞ s s. t. 2 ≤ϵ A global sparse vector is likely if it can represent every patch in the signal sparsely [Papyan et al. (’16)] We link between deep learning and sparse representations, providing mathematical grounds to analyze CNN theoretically. In addition, we propose new insights and architecture, with clear theoretical foundations References Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part I: Theoretical guarantees for convolutional sparse coding, submitted to IEEE-TSP, 2016 Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part II: Stability and algorithms for convolutional sparse coding, submitted to IEEE-TSP, 2016 Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus, Deconvolutional networks, CVPR, 2010 Karol Gregor and Yann LeCun, Learning fast approximations of sparse coding, ICML, 2010 Michael Elad, Sparse and redundant representations: From theory to applications in signal and image processing, Springer, 2010 2 LBP = min 2 2 1 s. t. 1 LBP = 2 2 1 LBP = min 1 1 1 s. t. = 1 1 2 ∈ℝ 2 1 ReLU ReLU 2 T ∈ℝ 2 × 1 1 1 2 1 T ∈ℝ 1 × 1 0 1 ∈ℝ 1 2 ∈ℝ 2 ∈ℝ , i , i = ReLU 2 + 2 T ReLU 1 + 1 T 1 0 0 1 2 1 1 2 i Estimate 1 via the thresholding Estimate 2 via the thresholding 2 = β 2 2 T β 1 1 T min i i=1 K , h j , , j , j Via the layered thresholding min i , i , h j , , , i , i j Output of last layer Classifier True label ∈ℝ 1 0 1 ∈ℝ × 1 1 1 2 2 ∈ℝ 1 × 2 1 1 ∈ℝ 1 1 ∈ℝ 1 2 ∈ℝ 2 Convolutional sparsity assumes an inherent structure is present in natural signals. Similarly, the representations themselves could also be assumed to have such a structure.

Upload: others

Post on 21-May-2020

77 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Convolutional Neural Networks Analyzed via Convolutional ... · Convolutional neural networks (CNN) lead to remarkable results in many fields Clear and profound theoretical understanding

Convolutional Neural Networks The forward pass algorithm (without pooling):

Mathematically:

Training stage:

Consider the task of classification, for example.

Given a set of signals 𝐗j j and their corresponding labels h 𝐗j j

, the CNN

learns an end-to-end mapping.

Motivation and Goals Convolutional neural networks (CNN) lead to remarkable results in many fields

Clear and profound theoretical understanding is still lacking

Sparse representation is a powerful model

Enjoys from a vast theoretical study, supporting its success

Recently, convolutional sparse coding (CSC) has also been analyzed thoroughly

There seems to be a relation between CSC and CNN:

Both have a convolutional structure

Both use a data driven approach for training the model

The most popular non-linearity employed in CNN, called ReLU, is known to be connected to sparsity (and shrinkage)

Layered Thresholding

Background:

Going Deeper: Multi-Layer CSC Model

Deep Coding Problem:

Deep Learning Problem:

{vardanp,yromano,elad}@{campus,tx,cs}.technion.ac.il

Convolutional Neural Networks Analyzed via Convolutional Sparse Coding

Vardan Papyan* The Computer Science Department

Yaniv Romano* The Electrical Engineering Department

Michael Elad

The Computer Science Department

Technion – Israel Institute of Technology

*Contributed Equally

This research was supported by the European Research Council under EU’s 7th Framework Program, ERC Grant agreement no. 320649.

Uniqueness of 𝐃𝐂𝐏λ

Stability of Layered Thresholding Thus far, our analysis relied on the local sparsity of the underlying solution 𝚪,

which was enforced through the ℓ0,∞ norm, while the noise was treated globally

In what follows, we present stability guarantees for the forward pass of CNN that will also depend on the local energy in the noise vector E

This will be enforced via the ℓ2,∞ norm, defined as:

Limitations of the Forward Pass Even in the noiseless case, it is incapable of recovering the solution of the DCP

Its success depends on the ratio 𝚪 imin / 𝚪 i

max . This is a direct consequence of

the forward pass algorithm relying on the simple thresholding operator

The distance between the true sparse vector and the estimated one increases exponentially as function of the layer depth

What’s Next? Layered Basis Pursuit (BP) Thresholding is the simplest pursuit known in the field of sparsity

What about replacing the thresholding with the basis pursuit:

Stability of Layered Basis Pursuit

Conclusion The relation between CNN and the sparse-land model has been

defined via our multi-layer convolutional sparse coding model

We have shown that the forward pass of CNN is in fact a pursuit algorithm that aims to decompose the signals belonging to our model into their building blocks

Leveraging this connection, we were able to attribute to the CNN architecture theoretical claims such as uniqueness of the representations (feature maps) throughout the network and their stable estimation, all guaranteed under local sparsity conditions

Sitting on these theoretical grounds, we have proposed a better pursuit that is shown to be theoretically superior to the forward pass

𝐃𝐋𝐏λ : min𝐃i i=1

K ,𝐔 ℓ h 𝐗j , 𝐔, 𝐃𝐂𝐏

⋆ 𝐗j, 𝐃𝐢

j

𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞

s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞

s ≤ λ2⋮ ⋮

𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK

𝐃𝐂𝐏⋆: Deepest representation obtained

by solving the DCP

Extension of the classic sparse representation,

suggest modeling signals as hierarchal sparse

representations

𝐃T𝐗 𝒫𝛽 𝐃T𝐗

Thresholding algorithm:

The layered (soft nonnegative) thresholding and the forward pass algorithm are the same !!!

The problem solved by the training stage of CNN and the DLP are equal assuming that the DCP is

approximated via the layered thresholding algorithm

Theorem: If a set of solutions 𝚪i i=1K is

found for (𝐃𝐂𝐏λ) such that:

𝚪i 0,∞s ≤ λi <

1

21 +

1

μ 𝐃i

Then these are necessarily the unique solution to this problem.

The feature maps CNN aims to recover are unique

Background:

[Papyan et al. (’16)]: If a solution 𝚪 is found for the first layer such that:

𝚪 0,∞s <

1

21 +

1

μ 𝐃

Then this is necessarily the unique solution

Mutual Coherence:

μ 𝐃 = maxi≠j

|diTdj|

Handling Noise 𝐃𝐂𝐏λℰ

So far, we have assumed an ideal signal 𝐗

However, in practice we usually have 𝐘 = 𝐗 + 𝐄

where 𝐄 is due to noise or model deviations

𝐃𝐂𝐏λℰ : Find a set of representations satisfying 𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞

s ≤ λ1𝚪1 −𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞

s ≤ λ2⋮ ⋮

𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞s ≤ λK

Theorem: If the true representations 𝚪i i=1K satisfy

𝚪i 0,∞s ≤ λi <

1

21 +

1

μ 𝐃i

And the error thresholds for (𝐃𝐂𝐏λℰ) are

ℰ02 = 𝐄 2

2, ℰi2 =

4ℰi−12

1 − 2 𝚪i 0,∞s − 1 μ 𝐃i

Then the set of solutions 𝚪 i i=1

K obtained by solving

this problem must be close to the true ones

𝚪 i − 𝚪i 2

2≤ ℰi

2

The problem CNN aims to

solve is stable under

certain conditions

Background:

[Papyan et al. (’16)]: If the true representation 𝚪 of the first layer satisfies

𝚪 0,∞s = k <

1

21 +

1

μ 𝐃

Then a solution 𝚪 for 𝐏0,∞ϵ must

be close to it

𝚪 − 𝚪2

2≤

4ϵ2

1 − 2k − 1 μ 𝐃

Theorem: If 𝚪i 0,∞s <

1

21 +

1

μ 𝐃i⋅𝚪 imin

𝚪 imax −

1

μ 𝐃i⋅

εLi−1

𝚪 imax

Then the layered soft thresholding will*

1. Find the correct supports

2. 𝚪 iLT − 𝚪i 2,∞

p≤ εL

i

We have defined εL0 = 𝐄 2,∞

p and

εLi = 𝚪i 0,∞

p⋅ εL

i−1 + μ 𝐃i 𝚪i 0,∞s − 1 𝚪 i

max + βi

* For correctly chosen thresholds

In the case of hard

thresholding: The threshold βi does not increase the

error εLi

εL0 = 𝐄 2,∞

p= max

i 𝐑i𝐄 2 𝐑i: patch extraction operator

The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded

Deconvolutional networks

[Zeiler et al. ‘10]

Guarantee for the Success of Layered BP

The layered BP can retrieve the representations in the

noiseless case, a task in which the forward pass fails

Its success does not depend on the ratio 𝚪 imin / 𝚪 i

max

Theorem: If a set of solutions 𝚪i i=1K of (𝐃𝐂𝐏λ) satisfy

𝚪i 0,∞s ≤ λi <

1

21 +

1

μ 𝐃i

Then the layered BP is guaranteed to find them.

Background:

[Papyan et al. (’16)]: If a solution 𝚪 of the first

layer satisfies:

𝚪 0,∞s <

1

21 +

1

μ 𝐃

Then global BP is guaranteed to find it

Theorem: Assuming that

𝚪i 0,∞s <

1

31 +

1

μ 𝐃i

Then we are guaranteed that*

1. The support of 𝚪 iLBP is contained in that of 𝚪i

2. 𝚪 iLBP − 𝚪i 2,∞

≤ εLi

3. Every entry in 𝚪i greater than εLi / 𝚪i 0,∞

pwill be found

εLi = 7.5i 𝐄 2,∞ 𝚪j 0,∞

pi

j=1

* For correctly chosen ξi i=1K

Background: [Papyan et al. (’16)] For the first layer, if ξ = 4 𝐄 2,∞

p and

𝚪 0,∞s <

1

31 +

1

μ 𝐃, then:

1. The support of 𝚪BP is contained in that of Γ

2. 𝚪BP − 𝚪 ∞ ≤ 7.5 𝐄 2,∞p

3. Every entry greater than 7.5 𝐄 2,∞p

will be found

4. 𝚪BP is unique.

Layered Basis Pursuit (Lagrangian)

𝚪2LBP = min

𝚪2 1

2 𝚪1

LBP −𝐃2𝚪2 2

2+ ξ2 𝚪2 1

𝚪1LBP = min

𝚪1 1

2𝐘 − 𝐃1𝚪1 2

2 + ξ1 𝚪1 1

Layered Iterative Thresholding

𝚪2t = 𝒮ξ2/c2 𝚪2

t−1 +1

c2𝐃2T 𝚪 1 − 𝐃2𝚪2

t−1

𝚪1t = 𝒮ξ1/c1 𝚪1

t−1 +1

c1𝐃1T 𝐘 − 𝐃1𝚪1

t−1

Can be seen as a recurrent neural

network [Gregor & LeCun ‘10]

Convolutional Sparse Coding

=

𝐑i𝐗 = 𝛀𝛄i

𝑛

(2𝑛 − 1)𝑚

𝛄i

𝐗 = 𝐃𝚪

stripe-dictionary

Stripe-vector i-th patch

𝐏0,∞ : min𝚪

𝚪 0,∞s s. t. 𝐗 = 𝐃𝚪

𝚪 0,∞s ≜ max

i 𝛄i 0

𝐏0,∞ϵ : min

𝚪 𝚪 0,∞

s s. t. 𝐘 − 𝐃𝚪 2 ≤ ϵ

A global sparse vector is likely if it can represent every

patch in the signal sparsely

[Papyan et al. (’16)]

We link between deep learning and sparse representations, providing mathematical grounds to analyze CNN theoretically. In addition, we propose

new insights and architecture, with clear theoretical foundations

References Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part I:

Theoretical guarantees for convolutional sparse coding, submitted to IEEE-TSP, 2016

Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part II:

Stability and algorithms for convolutional sparse coding, submitted to IEEE-TSP, 2016

Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus, Deconvolutional networks,

CVPR, 2010

Karol Gregor and Yann LeCun, Learning fast approximations of sparse coding, ICML, 2010

Michael Elad, Sparse and redundant representations: From theory to applications in signal and

image processing, Springer, 2010

𝚪2LBP = min

𝚪2 𝚪2 1 s. t. 𝚪1

LBP = 𝐃2𝚪2

𝚪1LBP = min

𝚪1 𝚪1 1 s. t. 𝐗 = 𝐃1𝚪1

𝐙2 ∈ ℝ𝑁𝑚2

𝑚1

ReLU ReLU

𝐖2T ∈ ℝ𝑁𝑚2×𝑁𝑚1

𝑛1𝑚1 𝑚2

𝐖1T ∈ ℝ𝑁𝑚1×𝑁

𝑚1 𝑛0

𝐛1 ∈ ℝ𝑁𝑚1

𝐛2 ∈ ℝ𝑁𝑚2

𝐗 ∈ ℝ𝑁

𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2TReLU 𝐛1 +𝐖1

T𝐗

𝑁

𝑁

𝐗 𝑚1

𝑁

𝑁

𝑛0

𝑛0 𝐖1

𝑚2

𝑁

𝑁 𝑛1

𝑛1

𝐖2

𝐑i𝐗

Estimate 𝚪1 via the thresholding

Estimate 𝚪2 via the thresholding

𝚪 2 = 𝒫β2 𝐃2T𝒫β1 𝐃1

T𝐗

min𝐃i i=1

K ,𝐔 ℓ h 𝐗j , 𝐔, 𝐃𝐂𝐏

⋆ 𝐗j, 𝐃𝐢

j

Via the layered thresholding

min𝐖i , 𝐛i ,𝐔

ℓ h 𝐗j , 𝐔, 𝑓 𝐗, 𝐖i , 𝐛ij

Output of last layer Classifier True label

𝐗 ∈ ℝ𝑁

𝑚1

𝑛0

𝐃1 ∈ ℝ𝑁×𝑁𝑚1

𝑛1𝑚1 𝑚2

𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2

𝑚1

𝚪1 ∈ ℝ𝑁𝑚1

𝚪1 ∈ ℝ𝑁𝑚1

𝚪2 ∈ ℝ𝑁𝑚2

Convolutional sparsity assumes an inherent structure is present in natural signals.

Similarly, the representations themselves could also be assumed to have such a structure.