total variation and calculus - cs.umd.edutomg/course/cmsc764/l3_tv_fft_calc.pdf · computing tv on...
TRANSCRIPT
TOTAL VARIATIONAND
CALCULUSLecture 3 - CMSC764
IMAGE GRADIENT
Neumann
Circulant
0
@u1 � u0
u2 � u1
u3 � u2
1
A
0
BB@
u1 � u0
u2 � u1
u3 � u2
u0 � u3
1
CCA
TOTAL VARIATIONTV (x) = |rx| =
X
k
|xk+1 � xk|
How to compute?[�1 1]
• Convolve with stencil [0,1 , -1]• Take absolute value• Take sum
TV IN 2D(rx)ij = (xi+1,j � xi,j , xi,j+1 � xi,j)
|(rx)ij | = |xi+1,j � xi,j |+ |xi,j+1 � xi,j)|Anisotropic
Isotropic k(rx)ijk =q
(xi+1,j � xij)2 + (xi,j+1 � xij)2
TOTAL VARIATION: 2D(rx)ij = (xi+1,j � xij , xi,j+1 � xij)
I gx = DxI gy = DyI
TViso(x) = |rx| =X
ij
q(xi+1,j � xij)2 + (xi,j+1 � xij)2
TVan(x) = |rx| =X
ij
|xi+1,j � xij |+ |xi,j+1 � xij |
COMPUTING TV ON IMAGES��1 1
�
0
@�1
1
1
A
•Two linear filters • x-stencil = (-1 1 0)•y-stencil = (-1 1 0)’
• Two linear convolutions • x-kernel = (0 1 -1)• y-kernel = (0 1 -1)’
…or…
COMPUTING TV ON IMAGESCirculant Boundary
Fast Transforms: use FFT
• Two linear convolutions • x-kernel = (0 1 -1)• y-kernel = (0 1 -1)’
FOURIER TRANSFORMx̂k =
X
n
xne�i2⇡kn/N
x̂k =< x,Fk >
10 20 30 40 50 60
10
20
30
40
50
60
0 10 20 30 40 50 60 70−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
mode 2mode 4mode 10
CONVOLUTIONf ⇤ g(x) =
X
s
f(x� s)g(s) =X
s
f(x+ s)g(�s)
don’t forget
WRONG!!
back
war
ds
kern
el
PROPERTIES OF FFT• Classical FFT is ORTHOGONAL
• Computable in O(N logN) time (Cooley-Tukey)
FHF = I
Fx =
✓I BI �B
◆✓Fxe
Fxo
◆
THE CONVOLUTION THEOREM
• Convolution = multiplication in Fourier domain
x ⇤ y = FT (Fx · Fy)
what does this mean?
A BETTER WAY TO THINK ABOUT ITx ⇤ y = FT (Fx · Fy)
•FFT DIAGONALIZES convolution matrices
Kx = FTDFx
Orthogonal
Diagonal
Trivia!•Is ? •Does ?
K1K2 = K2K1
@x@yu = @y@xu
HOW DO WE FIGURE OUT D?• The diagonal matrix is just the FFT of backward
filter stencil
[�1 1]•Define stencil: [-1 1 0]•Convolve with: [0 1 -1]•Embed stencil in matrix:
•Compute the diagonal
•Perform convolution using EVD
Kx = FTDFx
D = Fs
• Example: derivative
s = [1 � 1 0 0 0 0 0 0]
CONDITION NUMBERKx = FTDFx
s = [�1 0 0 0 · · · 0 0 0 1]
How poorly conditioned is a convolution?
How poorly conditioned is the FFT?
0 10 20 30 40 50 60 700
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Zero: why? What is the eigenvector
WHY IS THE CHIRP WELL-CONDITIONED?
Echolocation chirp: brown bat
Linear chirp
FFT IN THE WILDMatlabs DFT is NOT ORTHOGONAL!
WHY?
Convolution with a delta function = identity
FFT still diagonalizes convolutions!
F? =1pN
F FT? =
pNFT
FTDF =pNFTDFT 1p
N= FT
?DF?
Same D
CONTINUOUS GRADIENTS
WHY WE NEED GRADIENTFirst-order conditions
rf(x?) = 0
First-order methods
Gradient descent
THE CALCULUS YO MOMMA TAUGHT YOU
rf(x, y, z) = @xf i+ @yf j+ @zf k
f(x, y, z) : Rn ! R
What letter do we use for 4th variable?
What letter do we use for 1,000,000th variable?
DERIVATIVE
Freshet derivative is a linear operator withD
limkhk!0
f(x+ h)� f(x)�Dh
khk = 0
A linear operator that locally behaves like f
f : Rn ! Rm
In optimization: variables are usually column vectors
xDf
DERIVATIVEIn optimization: variables are usually column vectors
xDf
Df =�Dx1f Dx2f Dx3f
�
=
0
@Dx1f1 Dx2f1 Dx3f1Dx1f2 Dx2f2 Dx3f2Dx1f3 Dx2f3 Dx3f3
1
A
In this case the derivative is the Jacobian
GRADIENTIn optimization: variables are usually column vectors
xDf
For scalar-valued functions
rf =
0
@@x1f@x2f@x3f
1
ADf =�@x1f, @x2f, @x3f
�
rf
Definition: Gradient is the adjoint of the derivative
GRADIENTIn optimization: variables are usually column vectors
xDf rf
Better Definition: The gradient of a scalar-valued function is a matrix of derivatives that is the same shape as the unknowns
x =
0
@x11 x12 x13
x21 x22 x23
x31 x32 x33
1
A rf =
0
@@11f @12f @13f@21f @22f @23f@31f @32f @33f
1
A
example: matrix of unknowns
WHAT’S THE GRADIENT?!?!
f(x) = Ax
Function
f(x) = kxk2
f(x) =1
2xTAx
gradient
rf(x) = AT
rf(x) = Ax
rf(x) = 2x
CHAIN RULE
@f(g(x)) = f 0(g)g0(x)
single-variablechain rule
multi-variablechain rule
Swap
h(x) = f � g(x) = f(g(x))
Df(g(x)) = Df �Dg(x)
rf(g(x)) = rg(x)rf(g)
for gradients
EXAMPLE: LEAST SQUARES1
2kAx� bk2
Ax� b1
2kxk2
f(g(x)) =
x AT
r r
AT (Ax� b)
rg(x)rf(g)
f = = g
How would you minimize this?
IMPORTANT RULES
h(x) = f(Ax)
If
Then
If
Then
h(x) = f(xA)
Why?
rh(x) = AT f 0(Ax) rh(x) = f 0(xA)AT
EXAMPLE: RIDGE REGRESSION
f(x) =�
2kxk2 + 1
2kAx� bk2
Optimality Condition
rf(x) = �x+AT (Ax� b) = 0
(ATA+ �I)x = AT b
x? = (ATA+ �I)�1AT b
REGULARIZED GRADIENTS
|x| ⇡ h(x) =
(12x
2, for |x| �
�(|a|� 12�), otherwise
|x| ⇡p
x2 + ✏2
−3 −2 −1 0 1 2 30
0.5
1
1.5
2
2.5
3
Huber regularization
Hyperbolic regularization
GRADIENT OF TV|rx|+ µ
2kx� fk2
h(rx) +µ
2kx� fk2
Regularize L1
rTh0(rx) + µ(x� f)
Differentiate
h0(x) =xp
x2 + ✏2divide component-wise
LOGISTIC REGRESSIONData vectors
{Di}Labels{li}
Model�(Dix) =
eDix
1 + eDix
� =
Maximum Likelihood (MAP) Estimator
maximize p(x|D, l)
minimizex
mX
k=1
log(1 + exp(�lk(Dkx))) = flr(LDx)
LOGISTIC REGRESSIONminimize
x
mX
k=1
log(1 + exp(�lk(Dkx))) = flr(LDx)
D̂ = LD
Break the problem down
Coordinate separable Linear operator
flr(z) =mX
k=1
log(1 + exp(�zk))
minimizex
flr(D̂x)
r{flr(D̂x)} = D̂Trflr(D̂x) rflr(z) =�exp(�z)
1 + exp(�z)
NON-NEGATIVE MATRIX FACTORIZATION
minimizeX�0,Y�0
E(X,Y ) =1
2kXY � Sk2
@Y E = XT (XY � S)
@XE = (XY � S)Y T
� =
NEURAL NETS: LOGISTIC REGRESSION ON STEROIDS
MLP = “Multilayer Perception”
�(�(�(diW1)W2)W3) = y3
GRADIENT OF NET
di yi
�(y1W2) = y2 �(y2W3) = y3
�(�(�(diW1)W2)W3) = y3
�(d1W1) = y1
FINER BREAK DOWN
di yi
�(�(�(diW1)W2)W3) = y3
�(z1) = y1 �(z2) = y2 �(z3) = y3
dW1 = z1 y1W2 = z2 y2W3 = z3
GRADIENT OF NET�(�(�(diW1)W2)W3) = y3
Derivative of output with respect to input
or…
Where do we evaluate these gradients?
�(z1) = y1 �(z2) = y2 �(z3) = y3
dW1 = z1 y1W2 = z2 y2W3 = z3
@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
rdy3 = rdz1rz1y1ry1z2rz2y2ry2z3rz3y3
FORWARD PASS
compute intermediate values
di yi
dW1 = z1 y1W2 = z2 y2W3 = z3
Note: only need to store z’s
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
�(z3) = y3
D� = diag(�01,�
02, · · · ,�0
n) = diag(�0) rx� = diag(�0)
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
y2W3 = z3
Dy2z3 = W3
Dy2z3(·) = (·)W3
Note: this is multiplication on the right?
ry2z3(·) = WT3 (·)
ry2z3 = WT3
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1
dW1 = z1
@y3@d
=
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1 �01(z1)
�1(z1) = y1
@y3@d
=
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1 �01(z1)
y1W2 = z2
W2@y3@d
=
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1 �01(z1) W2
�2(z2) = y2
�02(z2)
@y3@d
=
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1 �01(z1) W2 �0
2(z2)
y2W3 = z3
W3@y3@d
=
BABY STEPS@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1 �01(z1) W2 �0
2(z2) W3
�3(z3) = y3
�03(z3)
@y3@d
=
WHAT’S THE GRADIENT?@y3@d
=@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1 �01(z1) W2 �0
2(z2) W3 �03(z3)
@y3@d
=
The Frechet derivative
The gradient
rdiy3 = �03(z3)W
T3 �0
2(z2)WT2 �0
1(z1)WT1
ADD A LOSS FUNCTION`(·) : Rn ! R
@`
@d=
@`
@y3
@y3@z3
� @z3@y2
� @y2@z2
� @z2@y1
� @y1@z1
� @z1@d
W1 �01(z1) W2 �0
2(z2) W3 �03(z3)
@y3@d
= `0(y3)
rdi` = `0(y3)�03(z3)W
T3 �0
2(z2)WT2 �0
1(z1)WT1
SANITY CHECK
What shape should the output be?
`0 = r` WT1
rdi` = `0(y3)�03(z3)W
T3 �0
2(z2)WT2 �0
1(z1)WT1
INSANITY CHECK:WHAT ABOUT WEIGHTS?
�(z1) = y1 �(z2) = y2 �(z3) = y3
dW1 = z1 y1W2 = z2 y2W3 = z3
`(�(�(�(diW1)W2)W3))
@`
@W2=
@`
@z2
@z2@W2
�02(z2) W3 �0
3(z3) `0(y3)
INSANITY CHECK:WHAT ABOUT WEIGHTS?
�(z1) = y1 �(z2) = y2 �(z3) = y3
dW1 = z1 y1W2 = z2 y2W3 = z3
`(�(�(�(diW1)W2)W3))
@`
@W2=
@`
@z2
@z2@W2
�02(z2) W3 �0
3(z3) `0(y3)y1
THE GRADIENT
�(z1) = y1 �(z2) = y2 �(z3) = y3
dW1 = z1 y1W2 = z2 y2W3 = z3
`(�(�(�(diW1)W2)W3))
rW2` = rW2z2rz2`
rW2` = yT1 `0(y3)�
03(z3)W
T3 �0
2(z2)
COMPLETE BACKPROP`(�(�(�(diW1)W2)W3))
�(z1) = y1 �(z2) = y2 �(z3) = y3
dW1 = z1 y1W2 = z2 y2W3 = z3
rz3` = `0(y3)�03(z3)
rW3` = yT2 rz3`
rz2` = `0(y3)�03(z3)W
T3 �0
2(z2)rW2` = yT1 rz2`
rz1` = `0(y3)�03(z3)W
T3 �0
2(z2)WT2 �0
1(z1)
rW1` = dTrz1`
CONVOLUTIONAL NETS
Convolution-based template matching
CONVOLUTIONAL NETS
Regular-old MLP:
Fancy-schmance conv-net :
conv matrix stencil
y1 = �(dW1)
y1 = �(dK1) = �(d ⇤ k1)
CONVOLUTIONAL NETS
y = �(�(�(dK1)K2)K3)
rz1` = `0(y3)�03(z3)K
T3 �
02(z2)K
T2 �
01(z1)
Adjoint of convolution
CONVOLUTIONAL NETS
K = FHDF
Fk
Adjoint
Stencil
???
Diagonal Adjoint of convolution
rx�i(xKi) = diag(�0i)K
Ti
CONVOLUTIONAL NETS
K = FHDF
Fk
AdjointK = FHD̄F
Stencil
???
Diagonal Adjoint of convolution
rx�i(xKi) = diag(�0i)K
Ti
CONVOLUTIONAL NETS
K = FHDF
Fk
AdjointK = FHD̄F
Fflip(k)
Stencil
Stencil
Diagonal Adjoint of convolution
rx�i(xKi) = diag(�0i)K
Ti