Divergence matching criteria for registration, indexingand retrieval
Alfred O. Hero
Dept. EECS, Dept Biomed. Eng., Dept. Statistics
University of Michigan - Ann Arbor
http://www.eecs.umich.edu/˜hero
Collaborators: Olivier Michel, Bing Ma, John Gorman
Outline
1. Statistical framework: entropy measures, error exponents
2. Registration, indexing and retrieval
3. α-entropy andα-MI estimation
4. Graph theoretic entropy estimation methods
1
(a) ImageI1 (b) ImageI0 (c) Registration result
Figure 1: A multidate image registration example
2
Statistical Framework� X: an image
� Z = Z(X): an image feature vector
� Θ: a parameter space
� f (zjθ): feature density (likelihood)
� XR a reference image
� fX(i)g a database ofK images
ZR = Z(XR) � f (zjθR)
Zi = Z(Xi) � f (zjθi); i = 1: : : : ;K
) Similarity btwnXi , XR lies in similarity btwn models
3
Divergence Measures
Refs: [Csiszar:67,Basseville:SP89]
Define densities
fi = f (zjθi); fR = f (zjθR)
The Renyi α-divergence of fractional orderα 2 [0;1] [Renyi:61,70 ]
Dα( fi k fR) = D(θikθR) =1
α�1ln
Z
fR
�
fifR
�αdx
=
1α�1
lnZ
f αi f 1�α
R dx
4
Renyi α-Divergence: Special cases� α-Divergence vs. Batthacharyya-Hellinger distance
D 12
( fi k fR) = ln
�Z p
fi fRdx
�2
D2BH( fik fR) =
Z �pfi �
pfR
�2dx= 2
�
1�Z p
fi fRdx
�
� α-Divergence vs. Kullback-Liebler divergence
limα!1
Dα( fi ; fR) =Z
fR lnfRfi
dx:
5
Renyi α-divergence and Error Exponents
Observe i.i.d. sampleW = [W1; : : : ;Wn]
H0 : Wj � f (wjθ0)
H1 : Wj � f (wjθ1)
Bayes probability of error
Pe(n) = β(n)P(H1)+α(n)P(H0)
LDP gives Chernoff bound [Dembo&Zeitouni:98]
liminfn!∞
1n
logPe(n) =� supα2[0;1]
f(1�α)Dα(θ1kθ0)g :6
Indexing via α-divergence
Refs: Vasconcelos&Lippman:DCC98, Stoica&etal:ICASSP98,Do&Vetterli:ICIP00
H0 : ZRi � f (zjθ0)
H1 : ZRi � f (zjθ1)
Clairvoyant indexing rule:
X(i) � X( j) , Dα( fik fR) < Dα( f jk fR)
Indexing problem: findθi attaining minθi 6=ΘR Dα(θikθR)
1. Image classification:fi index model classes [Stoica&etal:INRIA98]
2. Target detection:fR is noise reference andfi are target references.
Declare detection if minθi 6=ΘR Dα(θikθR)> threshold
7
Registration via α-Mutual-Information
Ref: Viola&Wells:ICCV95
1. ReferenceXR and targetXT .
2. Set of rigid transformationsfTig3. Derived feature vectors
ZR = Z(XR); Zi = Z(Ti(XT))
8
H0 : fZRj ;Z
ijg independent
H1 : fZRj ;Z
ijg dependent
Error exponent isα-MI (Pluim&etal:SPIE01,
Neemuchwala&etal:ICIP01)
MIα(ZR;Zi) =
1α�1
ln
Z
f α(ZR;Zi)( f (ZR) f (Zi))1�αdZRdZi :
9
Ultrasound Registration Example
Figure 2: Three ultrasound breast scans. From top to bottom are: case 151,
case 142 and case 162.
10
MI Scatterplots
50 100 150 200 250
50
100
150
200
250
50 100 150 200 250
50
100
150
200
250
50 100 150 200 250
50
100
150
200
250
50 100 150 200 250
50
100
150
200
250
Figure 3:MI Scatterplots. 1st Col: target=reference slice. 2nd Col: target = reference+1 slice.
11
−5 −4 −3 −2 −1 0 1 2 3 4 5−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
ANGLE OF ROTATION (deg) −−−−−−>
α−D
IVE
RG
EN
CE
(αM
I) −
−−
−−
−>
α−DIVERGENCE v/s α CURVES FOR α ∈ [0,1] FOR SINGLE PIXEL INTENSITY
α =0 α =0.1α =0.2α =0.3α =0.4α =0.5α =0.6α =0.7α =0.8α =0.9
α=0.1
α=0.9
α=0
Figure 4: α-Divergence as function of angle for ultra sound image regis-
tration
12
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.005
0.01
0.015
0.02
0.025
α −−−−−−>
curv
atur
e of
α−
MI −
−−
−−
−>
CURVATURE OF: α−MI v/s α CURVE FOR SINGLE PIXEL INTENSITY
Figure 5: Resolution ofα-Divergence as function of alpha
13
Feature Trees
Root Node
Depth 1
Depth 2
Not examined further
Figure 6:Part of feature tree data structure.
Terminal nodes (Depth 16)
Figure 7:Leaves of feature tree data structure.
14
ICA Features
Figure 8:Estimated ICA basis set for ultrasound breast image database
15
Simple Example
Figure 9: Bar images with contrast 1.02, 1.07 and 1.78. Background is low
variance white Gaussian while bar is uniform intensity.
16
Single Pixel vs Feature Tag
−5 −4 −3 −2 −1 0 1 2 3 4 50.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Degree of rotation
Mut
ual I
nfor
mat
ion
( α
~=
1 )
Single pixel based MI peak with submergence of structure in background
intensity ratio = 1.02intensity ratio = 1.07intensity ratio = 1.78
−5 −4 −3 −2 −1 0 1 2 3 4 50.6
0.7
0.8
0.9
1
1.1
1.2
Degree of rotation
Mut
ual I
nfor
mat
ion
( α
~=
1 )
Tag based MI peak with submergence of structure in background
intensity ratio = 1.02intensity ratio = 1.07intensity ratio = 1.78
Figure 10: Upper curves are single pixel based MI trajectories while lower
curves are 4�4 tag based MI trajectories for bar images.
17
US Registration Comparisons
151 142 162 151/8 151/16 151/32
pixel 0:6=0:9 0:6=0:3 0:6=0:3
tag 0:5=3:6 0:5=3:8 0:4=1:4
spatial-tag 0:99=14:6 0:99=8:4 0:6=8:3
ICA 0:7=4:1 0:7=3:9 0:99=7:7
Table 1: Numerator =optimal values ofα and Denominator = maximum
resolution of mutualα-information for registering various images (Cases
151, 142, 162) using various features (pixel, tag, spatial-tag, ICA). 151/8,
151/16, 151/32 correspond to ICA algorithm with 8, 16 and 32 basis ele-
ments run on case 151.
18
Methods of Divergence Estimation� Z = Z(X): a statistic (MI, reduced rank feature, etc)
� fZig: n i.i.d. realizations fromf (Z;θ)
Objective: EstimateDα( fik fR) from Zi ’s
1. Parametric density estimation methods
2. Non-parametric density estimation methods
3. Non-parametric minimal-graph estimation methods
19
Non-parametric estimation methods
Given i.i.d. sampleX = fX1; : : : ;Xng
Density “plug-in” estimator
Hα( fn) =
11�α
ln
Z
IRdf α(x)dx
Previous work limited to Shannon entropyH( f ) =�R
f (x) ln f (x)dx
� Histogram plug-in [Gyorfi&VanDerMeulen:CSDA87]
� Kernel density plug-in [Ahmad&Lin:IT76]
� Sample-spacing plug-in [Hall:JMS86](d = 1)
� Performance degrades as densityf becomes non smooth
� Unclear how to robustifyf against outliers
� d-dimensional integration might be difficult
� ) functionf f (x) : x2 IRdg over-parameterizes entropy functional
20
Direct α-entropy estimation� MST estimator ofα-entropy [Hero&Michel:IT99]:
Hα =
11�α
lnLγ(Xn)=n�α
� Direct entropy estimator: faster convergence for nonsmooth densities
� Parameterα is varied by varying interpoint distance measure
� Optimally prunedk-MST graphs robustifyf against outliers
� Greedy multi-scale MST approximations reduce combinatorial
complexity
21
Minimal Graphs: Minimal Spanning Tree (MST)
Let Mn = M (Xn) denote the possible sets of edges in the class of acyclic
graphs spanningXn (spanning trees).
The Euclidean Power Weighted MST achieves
LMST(Xn) = minMn
∑e2Mn
kekγ:
22
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
256 random samples
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
MST
Figure 11:A sample data set and the MST
23
Minimal Graphs: Pruned MST
Fix k, 1� k� n.
Let Mn;k = M (xi1; : : : ;xik) be a minimal graph connectingk distinct
verticesxi1; : : : ;xik.
Thek-MST T�n;k = T�(xi�1
; : : : ;xi�k
) is minimum of allk-point MST’s
L�n;k = L�(Xn;k) = mini1;:::;ik
minMn;k
∑e2Mn;k
kekγ
24
0 0.5 1
0
0.5
1
x2
k−MST (k=99): 1 outlier rejection
0 0.5 1
0
0.5
1
x2
(k=98): 2 outlier rejection
0 0.5 1
0
0.5
1
x1
x2
k−MST (k=62): 38 outlier rejection
0 0.5 1
0
0.5
1
(k=25): 75 outlier rejection
x1x2
Figure 12: k-MST for 2D torus density with and without the addition of
uniform “outliers”.
25
Convergence of MST
0 0.5 1
0
0.5
1
uniform 2−d distribution (n=100)
x
y
0 0.5 1
0
0.5
1
MST
x
y
0 500 1000 15000
5
10
15
20
25Mean MST length as function of n
n, uniform 2−d distribution on [0,1]
Mea
n M
ST
leng
th0 0.5 1
0
0.5
1
triangular 2−d distribution (n=100)
x
y
0 0.5 1
0
0.5
1
MST
x
y0 500 1000 1500
0
5
10
15
20
25Mean MST length as function of n
n, triangular 2−d distribution on [0,1]M
ST
leng
th
Figure 13:2D Triangular vs. Uniform sample study for MST.
26
0 500 1000 15000
5
10
15
20
25
N, o=unif, *=triang
MS
T le
ngth
0 500 1000 1500−1.5
−1.4
−1.3
−1.2
−1.1
−1
−0.9
−0.8
−0.7
number of points, o=unif, *=triang
2*Lo
g(Ln
/sqr
t(n)
)
Figure 14:MST and log MST weights as function of number of samples for
2D uniform vs. triangular study.
27
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 15:Continuous quasi-additive euclidean functional satisfies “self-similarity” property on any scale.
28
Asymptotics: the BHH Theorem and entropy estimation
Theorem 1Beardwood&etal:Camb59,Steele:95,Redmond&Yukich:SPA96Let L
be a continuous quasi-additive Euclidean functional with power-exponent
γ, and letXn = fX1; : : : ;Xng be an i.i.d. sample drawn from a distribution
on [0;1]d with an absolutely continuous component having (Lebesgue)
density f(x). Then
(1)
limn!∞
Lγ(Xn)=n(d�γ)=d = βLγ;d
Z
f (x)(d�γ)=ddx; (a:s:)
Or, lettingα = (d� γ)=d
limn!∞
Lγ(Xn)=nα = βLγ;d exp((1�α)Hα( f )) ; (a:s:)
29
Extension to Pruned Graphs
Fix α 2 [0;1] and letk= bαnc. Then asn! ∞ (Hero&Michel:IT99)
L(X �
n;k)=(bαnc)ν ! βLγ;d minA:P(A)�α
Z
f ν(xjx2 A)dx (a:s:)
or, alternatively, with
Hν( f jx2 A) =1
1�νln
Z
f ν(xjx2 A)dx
L(X �
n;k)=(bαnc)ν ! βL;γ exp
�(1�ν) min
A:P(A)�αHν( f jx2 A)
�(a:s:)
30
Asymptotics: Plug-in estimation ofHα( f )
Class of Holder continuous functions over[0;1]d
Σd(κ;c) =n
f (x) : j f (x)� pbκcx (z)j � c kx�zkκ
o
Class of functions of Bounded Variation (BV) over[0;1]d
BVd(c) =(
f (x) : supfxig
∑i
j f (xi)� f (xi�1)j � c
):
Proposition 1 (Hero&Ma:IT01) Assume that fα 2 Σd(κ;c). Then, if f α
is aminimax estimator
supf α2Σd(κ;c)
E1=p
�����Z bf α(x)dx�Z
f α(x)dx����p� = O
�n�κ=(2κ+d)
�31
Asymptotics: Minimal-graph estimation of Hα( f )
Proposition 2 (Hero&Ma:IT01) Let d� 2 and
α = (d� γ)=d 2 [1=2;(d�1)=d]. Assume that fα 2 Σd(κ;c) whereκ� 1
and c< ∞. Then for any continuous quasi-additive Euclidean functional
Lγ
supf α2Σd(κ;c)
E1=p
�����Lγ(X1; : : : ;Xn)nα �βLγ;d
Z
f α(x)dx
����p��O
�
n�1=(d+1)�
Conclude: minimal-graph estimator converges faster for
κ <
dd�1
32
As Σd(1;c)� BVd(c), we have
Corollary 1 (Hero&Ma:IT01) Let d� 2 and
α = (d� γ)=d 2 [1=2;(d�1)=d]. Assume that fα is of bounded
variation over[0;1]d. Then
supf α2BVd(c)
E1=p
�����Z bf α(x)dx�βLγ;d
Z
f α(x)dx
����p� � O
�
n�1=(d+2)�
supf α2BVd(c)
E1=p
�����Lγ(X1; : : : ;Xn)
nα �βLγ;d
Zf α(x)dx
����p� � O
�
n�1=(d+1)�
33
Observations� Minimal graph rates valid for MST,k-NN graph, TSP, Steiner Tree,
etc
� Analogous rate bound holds for progressive-resolution algorithm
LGγ (Xn) =
md
∑i=1
Lγ(Xn\Qi)
fQig is uniform partition of[0;1]d into cell volumes 1=md
� Optimal sequence of cell volumes is:
m�d = n�1=(d+1)
� These results also apply to greedy multi-resolutionk-MST
34
Application: Image Registration
Two independent data samples from unknown distributions
� X = [X1; : : : ;Xm] � f (x)
�Y = [Y1; : : : ;Yn] � g(x)Suppose:g(x) = f (Ax+b), ATA= I
Objective: find rigid transformationA;b
� Two methods:
1. α-MI of f(Xi ;Yi)g
ni=1
2. α-Entropy offXig
mi=1+fYig
ni=1
35
Reference image
50 100 150 200 250 300
50
100
150
200
250
300
350
(a)
Image at 300,0,110 rotation
50 100 150 200 250 300
50
100
150
200
250
300
350
(b)
Image at 290,−20,130 rotation
50 100 150 200 250 300
50
100
150
200
250
300
350
(c)
Figure 16: Reference and target SAR/DEM images
36
O(n�1=(2d+1)) algorithm for α-MI estimation
MIα(X;Y) =
1α�1
ln
Z
f αX;Y(x;y)( fX(x) fY(y))
1�αdxdy:
Algorithm:
1. Kernel estimates fX; fY ( O(n�1=(d+2)))
2. Uniformizing probability transformations :
X = FX(X); Y = FY(Y)
3. Graph entropy estimate of MIα(X;Y) ( O(n�1=(2d+1)))
Lγ(f(X1;Y1); : : : ;(Xn;Yn)g)
nα ! βLγ;dZ
f αX;Y(x;y)dxdy
= βLγ;d
Zf αX;Y(x;y)( fX(x) fY(y))
1�αdxdy (w:p
37
O(n�1=(d+1)) criterion: α-Jensen difference� Jensen’s difference btwnf0; f1:
∆Jα = Hα(ε f1+(1� ε) f0)� εHα( f1)� (1� ε)Hα( f0)� 0
� f0; f1 are two densities,ε satisfies 0� ε� 1
� Let X;Y be i.i.d. features extracted from two images
X = fX1; : : : ;Xmg; Y = fY1; : : : ;Yng
� Each realization inunorderedsampleZ = fX; Yg has marginal
fZ(z) = ε fX(z)+(1� ε) fY(z); ε =
mn+m
� α-Jensen difference for rigid transformation T
∆Jα(T) = Hα(ε fX +(1� ε) fY)� εHα( fX)� (1� ε)Hα( fY)| {z }
constant
38
Reference image
50 100 150 200 250 300
50
100
150
200
250
300
350
(a)
Image at 300,0,110 rotation
50 100 150 200 250 300
50
100
150
200
250
300
350
(b)
0 100 200 3000
50
100
150
200
250
300
350Voronoi regions for Image (290,−20,130)
(c)
0 100 200 3000
50
100
150
200
250
300Voronoi region for reference image
(d)
Figure 17: Reference and target SAR/DEM images
0
200
400
0
100
200
3000
10
20
30
40
X
misaligned points
Y
Inte
nsity
(a)
0
200
400
0
100
200
3000
10
20
30
40
X
MST demonstration
Y
Inte
nsity
(b)
Figure 18: MST demonstration for misaligned images
39
0
200
400
0
100
200
3000
20
40
60
X
Aligned points
Y
Inte
nsity
(a)
0
200
400
0
100
200
3000
20
40
60
X
MST demonstration
Y
inte
nsity
(b)
Figure 19: MST demonstration for aligned images
40
Conclusions
1. α-divergence for indexing can be justified via decision theory
2. Non-parametric estimation of Jensen’s difference is low complexity
alternative toα-divergence estimation
3. Non-parametric estimation of Jensen’s difference is possible without
density estimation
4. Minimal-graph estimation outperforms plug-in estimation for
non-smooth densities
41
Divergence vs. Jensen: Asymptotic Comparison
For ε 2 [0;1] andg a p.d.f. define
fε = ε f1+(1� ε) f0; Eg[Z] =Z
Z(x)g(x)dx; f α12
=
f α12R
f α12dx
Then
∆Jα =
αε(1� ε)
2
24Ef α12
0@" f1� f0f 1
2
#2
1A+
α1�α
Ef α12
"
f1� f0f 1
2
#!2
35+O(∆)
Dα( f1k f0) =
α4
Z
f 12
"
f1� f0f 1
2
#2
dx+O(∆)
42