convolutional neural networks analyzed via convolutional ... · convolutional neural networks (cnn)...
TRANSCRIPT
Convolutional Neural Networks The forward pass algorithm (without pooling):
Mathematically:
Training stage:
Consider the task of classification, for example.
Given a set of signals 𝐗j j and their corresponding labels h 𝐗j j
, the CNN
learns an end-to-end mapping.
Motivation and Goals Convolutional neural networks (CNN) lead to remarkable results in many fields
Clear and profound theoretical understanding is still lacking
Sparse representation is a powerful model
Enjoys from a vast theoretical study, supporting its success
Recently, convolutional sparse coding (CSC) has also been analyzed thoroughly
There seems to be a relation between CSC and CNN:
Both have a convolutional structure
Both use a data driven approach for training the model
The most popular non-linearity employed in CNN, called ReLU, is known to be connected to sparsity (and shrinkage)
Layered Thresholding
Background:
Going Deeper: Multi-Layer CSC Model
Deep Coding Problem:
Deep Learning Problem:
{vardanp,yromano,elad}@{campus,tx,cs}.technion.ac.il
Convolutional Neural Networks Analyzed via Convolutional Sparse Coding
Vardan Papyan* The Computer Science Department
Yaniv Romano* The Electrical Engineering Department
Michael Elad
The Computer Science Department
Technion – Israel Institute of Technology
*Contributed Equally
This research was supported by the European Research Council under EU’s 7th Framework Program, ERC Grant agreement no. 320649.
Uniqueness of 𝐃𝐂𝐏λ
Stability of Layered Thresholding Thus far, our analysis relied on the local sparsity of the underlying solution 𝚪,
which was enforced through the ℓ0,∞ norm, while the noise was treated globally
In what follows, we present stability guarantees for the forward pass of CNN that will also depend on the local energy in the noise vector E
This will be enforced via the ℓ2,∞ norm, defined as:
Limitations of the Forward Pass Even in the noiseless case, it is incapable of recovering the solution of the DCP
Its success depends on the ratio 𝚪 imin / 𝚪 i
max . This is a direct consequence of
the forward pass algorithm relying on the simple thresholding operator
The distance between the true sparse vector and the estimated one increases exponentially as function of the layer depth
What’s Next? Layered Basis Pursuit (BP) Thresholding is the simplest pursuit known in the field of sparsity
What about replacing the thresholding with the basis pursuit:
Stability of Layered Basis Pursuit
Conclusion The relation between CNN and the sparse-land model has been
defined via our multi-layer convolutional sparse coding model
We have shown that the forward pass of CNN is in fact a pursuit algorithm that aims to decompose the signals belonging to our model into their building blocks
Leveraging this connection, we were able to attribute to the CNN architecture theoretical claims such as uniqueness of the representations (feature maps) throughout the network and their stable estimation, all guaranteed under local sparsity conditions
Sitting on these theoretical grounds, we have proposed a better pursuit that is shown to be theoretically superior to the forward pass
𝐃𝐋𝐏λ : min𝐃i i=1
K ,𝐔 ℓ h 𝐗j , 𝐔, 𝐃𝐂𝐏
⋆ 𝐗j, 𝐃𝐢
j
𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞
s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞
s ≤ λ2⋮ ⋮
𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK
𝐃𝐂𝐏⋆: Deepest representation obtained
by solving the DCP
Extension of the classic sparse representation,
suggest modeling signals as hierarchal sparse
representations
𝐃T𝐗 𝒫𝛽 𝐃T𝐗
Thresholding algorithm:
The layered (soft nonnegative) thresholding and the forward pass algorithm are the same !!!
The problem solved by the training stage of CNN and the DLP are equal assuming that the DCP is
approximated via the layered thresholding algorithm
Theorem: If a set of solutions 𝚪i i=1K is
found for (𝐃𝐂𝐏λ) such that:
𝚪i 0,∞s ≤ λi <
1
21 +
1
μ 𝐃i
Then these are necessarily the unique solution to this problem.
The feature maps CNN aims to recover are unique
Background:
[Papyan et al. (’16)]: If a solution 𝚪 is found for the first layer such that:
𝚪 0,∞s <
1
21 +
1
μ 𝐃
Then this is necessarily the unique solution
Mutual Coherence:
μ 𝐃 = maxi≠j
|diTdj|
Handling Noise 𝐃𝐂𝐏λℰ
So far, we have assumed an ideal signal 𝐗
However, in practice we usually have 𝐘 = 𝐗 + 𝐄
where 𝐄 is due to noise or model deviations
𝐃𝐂𝐏λℰ : Find a set of representations satisfying 𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞
s ≤ λ1𝚪1 −𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞
s ≤ λ2⋮ ⋮
𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞s ≤ λK
Theorem: If the true representations 𝚪i i=1K satisfy
𝚪i 0,∞s ≤ λi <
1
21 +
1
μ 𝐃i
And the error thresholds for (𝐃𝐂𝐏λℰ) are
ℰ02 = 𝐄 2
2, ℰi2 =
4ℰi−12
1 − 2 𝚪i 0,∞s − 1 μ 𝐃i
Then the set of solutions 𝚪 i i=1
K obtained by solving
this problem must be close to the true ones
𝚪 i − 𝚪i 2
2≤ ℰi
2
The problem CNN aims to
solve is stable under
certain conditions
Background:
[Papyan et al. (’16)]: If the true representation 𝚪 of the first layer satisfies
𝚪 0,∞s = k <
1
21 +
1
μ 𝐃
Then a solution 𝚪 for 𝐏0,∞ϵ must
be close to it
𝚪 − 𝚪2
2≤
4ϵ2
1 − 2k − 1 μ 𝐃
Theorem: If 𝚪i 0,∞s <
1
21 +
1
μ 𝐃i⋅𝚪 imin
𝚪 imax −
1
μ 𝐃i⋅
εLi−1
𝚪 imax
Then the layered soft thresholding will*
1. Find the correct supports
2. 𝚪 iLT − 𝚪i 2,∞
p≤ εL
i
We have defined εL0 = 𝐄 2,∞
p and
εLi = 𝚪i 0,∞
p⋅ εL
i−1 + μ 𝐃i 𝚪i 0,∞s − 1 𝚪 i
max + βi
* For correctly chosen thresholds
In the case of hard
thresholding: The threshold βi does not increase the
error εLi
εL0 = 𝐄 2,∞
p= max
i 𝐑i𝐄 2 𝐑i: patch extraction operator
The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded
Deconvolutional networks
[Zeiler et al. ‘10]
Guarantee for the Success of Layered BP
The layered BP can retrieve the representations in the
noiseless case, a task in which the forward pass fails
Its success does not depend on the ratio 𝚪 imin / 𝚪 i
max
Theorem: If a set of solutions 𝚪i i=1K of (𝐃𝐂𝐏λ) satisfy
𝚪i 0,∞s ≤ λi <
1
21 +
1
μ 𝐃i
Then the layered BP is guaranteed to find them.
Background:
[Papyan et al. (’16)]: If a solution 𝚪 of the first
layer satisfies:
𝚪 0,∞s <
1
21 +
1
μ 𝐃
Then global BP is guaranteed to find it
Theorem: Assuming that
𝚪i 0,∞s <
1
31 +
1
μ 𝐃i
Then we are guaranteed that*
1. The support of 𝚪 iLBP is contained in that of 𝚪i
2. 𝚪 iLBP − 𝚪i 2,∞
≤ εLi
3. Every entry in 𝚪i greater than εLi / 𝚪i 0,∞
pwill be found
εLi = 7.5i 𝐄 2,∞ 𝚪j 0,∞
pi
j=1
* For correctly chosen ξi i=1K
Background: [Papyan et al. (’16)] For the first layer, if ξ = 4 𝐄 2,∞
p and
𝚪 0,∞s <
1
31 +
1
μ 𝐃, then:
1. The support of 𝚪BP is contained in that of Γ
2. 𝚪BP − 𝚪 ∞ ≤ 7.5 𝐄 2,∞p
3. Every entry greater than 7.5 𝐄 2,∞p
will be found
4. 𝚪BP is unique.
Layered Basis Pursuit (Lagrangian)
𝚪2LBP = min
𝚪2 1
2 𝚪1
LBP −𝐃2𝚪2 2
2+ ξ2 𝚪2 1
𝚪1LBP = min
𝚪1 1
2𝐘 − 𝐃1𝚪1 2
2 + ξ1 𝚪1 1
Layered Iterative Thresholding
𝚪2t = 𝒮ξ2/c2 𝚪2
t−1 +1
c2𝐃2T 𝚪 1 − 𝐃2𝚪2
t−1
𝚪1t = 𝒮ξ1/c1 𝚪1
t−1 +1
c1𝐃1T 𝐘 − 𝐃1𝚪1
t−1
Can be seen as a recurrent neural
network [Gregor & LeCun ‘10]
Convolutional Sparse Coding
=
𝐑i𝐗 = 𝛀𝛄i
𝑛
(2𝑛 − 1)𝑚
𝛄i
𝐗 = 𝐃𝚪
stripe-dictionary
Stripe-vector i-th patch
𝐏0,∞ : min𝚪
𝚪 0,∞s s. t. 𝐗 = 𝐃𝚪
𝚪 0,∞s ≜ max
i 𝛄i 0
𝐏0,∞ϵ : min
𝚪 𝚪 0,∞
s s. t. 𝐘 − 𝐃𝚪 2 ≤ ϵ
A global sparse vector is likely if it can represent every
patch in the signal sparsely
[Papyan et al. (’16)]
We link between deep learning and sparse representations, providing mathematical grounds to analyze CNN theoretically. In addition, we propose
new insights and architecture, with clear theoretical foundations
References Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part I:
Theoretical guarantees for convolutional sparse coding, submitted to IEEE-TSP, 2016
Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part II:
Stability and algorithms for convolutional sparse coding, submitted to IEEE-TSP, 2016
Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus, Deconvolutional networks,
CVPR, 2010
Karol Gregor and Yann LeCun, Learning fast approximations of sparse coding, ICML, 2010
Michael Elad, Sparse and redundant representations: From theory to applications in signal and
image processing, Springer, 2010
𝚪2LBP = min
𝚪2 𝚪2 1 s. t. 𝚪1
LBP = 𝐃2𝚪2
𝚪1LBP = min
𝚪1 𝚪1 1 s. t. 𝐗 = 𝐃1𝚪1
𝐙2 ∈ ℝ𝑁𝑚2
𝑚1
ReLU ReLU
𝐖2T ∈ ℝ𝑁𝑚2×𝑁𝑚1
𝑛1𝑚1 𝑚2
𝐖1T ∈ ℝ𝑁𝑚1×𝑁
𝑚1 𝑛0
𝐛1 ∈ ℝ𝑁𝑚1
𝐛2 ∈ ℝ𝑁𝑚2
𝐗 ∈ ℝ𝑁
𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2TReLU 𝐛1 +𝐖1
T𝐗
𝑁
𝑁
𝐗 𝑚1
𝑁
𝑁
𝑛0
𝑛0 𝐖1
𝑚2
𝑁
𝑁 𝑛1
𝑛1
𝐖2
𝐑i𝐗
Estimate 𝚪1 via the thresholding
Estimate 𝚪2 via the thresholding
𝚪 2 = 𝒫β2 𝐃2T𝒫β1 𝐃1
T𝐗
min𝐃i i=1
K ,𝐔 ℓ h 𝐗j , 𝐔, 𝐃𝐂𝐏
⋆ 𝐗j, 𝐃𝐢
j
Via the layered thresholding
min𝐖i , 𝐛i ,𝐔
ℓ h 𝐗j , 𝐔, 𝑓 𝐗, 𝐖i , 𝐛ij
Output of last layer Classifier True label
𝐗 ∈ ℝ𝑁
𝑚1
𝑛0
𝐃1 ∈ ℝ𝑁×𝑁𝑚1
𝑛1𝑚1 𝑚2
𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2
𝑚1
𝚪1 ∈ ℝ𝑁𝑚1
𝚪1 ∈ ℝ𝑁𝑚1
𝚪2 ∈ ℝ𝑁𝑚2
Convolutional sparsity assumes an inherent structure is present in natural signals.
Similarly, the representations themselves could also be assumed to have such a structure.