quantization, compression, and classification: extracting ...gray/ucsd06.pdfa/d conversion, data...
TRANSCRIPT
UCSD
27 January 2006
Quantization,Compression, and Classification:Extracting discrete information
from a continuous world
Robert M. GrayInformation Systems Laboratory, Dept of Electrical Engineering
Stanford, CA [email protected]
Partially supported by the National Science Foundation, Norsk Electro-Optikk,and Hewlett Packard Laboratories.
A pdf of these slides may be found at http://ee.stanford.edu/~gray/ucsd06.pdf
Quantization 1
Introduction
continuous world
discrete representations��
����
@@
@@@R
encoder α
011100100 · · ·
13HH
HHHHj
decoder β
“Picture of a privateer”
������*decoder β
How well can you do it?
What do you mean by “well”?
How do you do it?
Quantization 2
The dictionary (Random House) definition of quantization:
the division of a quantity into a discrete number of small
parts, often assumed to be integral multiples of a common
quantity.
In general, The mapping of a continuous quantity into a
discrete quantity.
Converting a continuous quantity into a discrete quantity
causes loss of accuracy and information. Usually need to
minimize the loss.
Quantization 3
Quantization
Old example: Round off real numbers to nearest integer:
Estimate densities by histograms [Sheppard (1898)]
More generally, quantizer of a space A (e.g., <k, L2([0, 1]2)consists of
• An encoder α : A → I, I an index set, e.g., nonnegative
integers. ⇔ partition S = {Si; i ∈ I}: Si = {x : α(x) = i}.
• A decoder β : I → C ⇔ codebook C = β(α(A))).
Scalar quantizer:
- x︸ ︷︷ ︸S0
︸ ︷︷ ︸S1
︸ ︷︷ ︸S2
︸︷︷︸S3
︸ ︷︷ ︸S4
β(0)∗
β(1)∗
β(2)∗
β(3)∗
β(4)∗
Quantization 4
Two dimensional example:
Centroidal Voronoi diagram:
Nearest neighbor partition
Euclidean centroids
E.g., complex numbers, nearest
mail boxes, sensors, repeaters, fish
fortresses . . .
territories of the male Tilapia
mossambica [G. W. Barlow,
Hexagonal Territories, Animal
Behavior, Volume 22, 1974]
Quantization 5
Three dimensions: Voronoi
partition around spheres
http://www-
math.mit.edu/dryfluids/gallery/
Quantization 6
Quality/Distortion/Cost
Quality of a quantizer measured by the goodness of the
resulting reproduction in comparison to the original.
Assume a distortion measure d(x, y) ≥ 0 which measures
penalty if an input x results in an output y = β(α(x)).
Small (large) distortion ⇔ good (bad) quality
System quality measured by average distortion
Assume X a random object (variable, vector, process, field) with
probability distribution PX: pixel intensities, samples, features,
fields, DCT or wavelet transform coefficients, moments, etc.
If A = <k, PX ⇔ pdf f , pmf p, or empirical distribution PLfrom training/learning set L = {xl; l = 1, 2, . . . , |L|}
Quantization 7
Average Distortion wrt PX: D(α, β) = E[d(X,β(α(X))]
Examples:
• Mean squared error (MSE) d(x, y) = ||x−y||2 =∑k
i=1 |xi−yi|2Useful if y is intended to reproduce or approximate x.
More generally: input or output weighted quadratic, Bx positive
definite matrix: (x− y)tBx(x− y), (x− y)tBy(x− y)
• X is observation, but unseen Y is desired information
“Sensor” described by PX|Y , e.g., X = Y +W .
Quantize x, reconstruct y = κ(i). Bayes risk C(y, y).
d(x, i) = E[C(Y, κ(i))|X = x]
E[d(X,κ(α(X))] = E[C(Y, κ(α(X)))]
Quantization 8
• Y discrete ⇒ classification, detection
• Y continuous ⇒ estimation, regression
Y -
SENSOR
PX|Y -X -
ENCODER
α - α(X) = i
DECODER
-κ -
β -
Y = κ(i)X = β(i)
Or weighted combination of MSE for X, Bayes risk for Y
Problem: Need to know or estimate PY |X, harder to estimate
than PX.
Quantization 9
Want average distortion small, which can do with big
codebooks and small cells, but . . .
. . . there is usually a cost r(i) of choosing index i in terms
of storage or transmission capacity or computational capacity —
more cells imply more memory, more time, more bits.
For example,
Quantization 10
• r(i) = ln |C|, All codewords have equal cost. Fixed-rate, classic
quantization.
• r(i) = `(i) — a length function satisfying∑
i e−`(i) ≤ 1
Kraft inequality ⇔ uniquely decodable lossless code exists with
lengths `(i). # of bits or nats required to communicate/store
index.
• r(i) = `(i) = − ln Pr(α(X) = i), Shannon codelengths,
E(`(α(X)) = H(α(X)) = −∑
i pi ln pi, Shannon entropy.
Optimal choice of ` satisfying Kraft. Optimal lossless
compression of `(α(X)). Minimizes required channel capacity
for communication.
• r(i) = (1 − η)`(i) + η ln |C|, η ∈ [0, 1] Combined transmission
length and memory constraints. (current research)
Quantization 11
Issues
If know distributions,
What is optimal tradeoff of distortion vs. rate?
(Want both to be small!)
How design good codes?
If do not know distributions, how use training/learning data
estimate tradeoffs and design good codes? (statistical learning,
machine learning)
Quantization 12
Applications
•
Communications & signal processing:
A/D conversion, data compression.
Compression required for efficient
transmission– send more data in available
bandwidth
– send the same data in less bandwidth
– more users on same bandwidth
and storage– can store more data– can compress for local storage, put
details on cheaper media
Graphic courtesy of Jim Storer.
Quantization 13
• Statistical clustering: grouping bird songs, designing Mao
suites, grouping gene features, taxonomy.
• Placing points in space: mailboxes, wireless repeaters, wireless
sensors
• How best approximate continuous probability distributions by
discrete ones? Quantize space of distributions.
• Numerical integration and optimal quadrature rules. Want to
estimate integral∫g(x) dx by sum
∑i∈I V (Si)g(yi), where
V (Si) is the volume of cell Si in the partition S. How best
choose the points yi and cells Si?
• Pick best Gauss mixture model from a finite set based on
training data. Plug into Bayes estimator.
Quantization 14
Distortion-rate optimization: q = (α, β, d, r)
D(q) = E[d(X,β(α(X)))] vs. R(q) = E[r(α(X))]
Minimize D(q) for R(q) ≤ R δ(R) = infq:R(q)≤R
D(q)
Minimize R(q) for D(q) ≤ D r(D) = infq:D(q)≤D
R(q)
Lagrangian approach: ρλ(x, i) = d(x, β(i)) + λr(i), λ ≥ 0
Minimize ρ(f, λ, η, q) ∆= Eρλ(X,α(X))
= D(q) + λR(q)
= Ed(X,β(α(X))) + λ [(1− η)E`(α(X)) + η ln |C|]
ρ(f, λ, η) = infqρ(f, λ, η, q)
Traditional fixed-rate η = 1, variable-rate η = 0
Quantization 15
Three Theories of Quantization
Rate-Distortion Theory Shannon distortion-rate function
D(R)≤ δ(R), achievable in asymptopia of large dimension k
and fixed rate R. [Shannon (1949, 1959), Gallager (1978)]
? Nonasymptotic (Exact) results Necessary conditions for
optimal codes ⇒ iterative design algorithms
⇔ statistical clustering [Steinhaus (1956), Lloyd (1957)]
? High rate theory Optimal performance in asymptopia of fixed
dimension k and large rate R . [Bennett (1948), Lloyd
(1957), Zador (1963), Gersho (1979), Bucklew, Wise (1982)]
Quantization 16
Lloyd Optimality Properties: q = (α, β, d, r)
Any component can be optimized for the others.
Encoder α(x) = argmini
(d(x, β(i)) + λr(i)) Minimum distortion
Decoder β(i) = argminy
E(d(X, y)|α(X) = i) Lloyd centroid
Length Function `(i) = − lnPf(α(X) = i) Shannon codelength
Pruning remove any index i if the pruned code α′, β′, r′ satisfies
D(α′, β′) + λR(α′, r′) ≤ D(α, β) + λR(α, `). Pruning
⇒ Lloyd clustering algorithm, (rediscovered as k-means,
grouped coordinate descent, alternating optimization, principal
points)
For estimation application, often nearly optimal to quantize
optimal nonlinear estimate (Ephraim)
Quantization 17
Classic case of fixed length, squared error =⇒Voronoi partition (Dirichlet tesselation, Theissen polygons,
Wigner-Seitz zones, medial axis transforms), centroidal
codebook. Centroidal Voronoi partition
Voronoi diagrams common in many fields including biology,
cartography, crystallography, chemistry, physics, astronomy,
meteorology, geography, anthropology, computational geometry
. . .
Quantization 18
Scott Snibbe’s
Boundary Functions
http://www.snibbe.com/scott/bf/
The Voronoi game: Competitive facility
location
http://www.voronoigame.com
Quantization 19
High Rate (Resolution) Theory
Traditional form (Zador, Gersho, Bucklew, Wise): pdf f ,
MSE
limR→∞
e2kRδ(R) =
ak||f || kk+2
fixed-rate (R = lnN)
bke2kh(f) variable-rate (R = H), where
h(f) = −∫
dxf(x) ln f(x), ||f ||k/(k+2) =(∫
x
f(x)k
k+2 dx
)k+2k
ak ≥ bk are Zador constants (depend on k and d, not f !)
a1 = b1 =112, a2 =
518√
3, b2 = ?, ak, bk = ? for k ≥ 3
limk→∞
ak = limk→∞
bk = 1/2πe
Quantization 20
Lagrangian form: (Gray, Linder, Li, Gill) variable-rate
limλ→0
[infq
(ρ(f, λ, 0, q)
λ
)+k
2lnλ
]= θk + h(f)
θk∆= inf
λ>0
[infq
(ρ(u, λ, 0, q)
λ
)+k
2lnλ
]=
k
2ln
2ebkk
limλ→0
[infq
(ρ(f, λ, 1, q)
λ
)+k
2lnλ
]= ψk + ln ||f ||k/2
k/(k+2)
ψk∆= inf
λ>0
[infq
(ρ(u, λ, 1, q)
λ
)+k
2lnλ
]=
k
2ln
2eak
k
fixed-rate both have form optimum for u + function of f
Quantization 21
Ongoing: Conjecture similar results for combined constraints
limλ→0
(infq
(Ef [d(X,β(α(X)))]
λ+ (1− η)Hf(q) + η lnNq
)︸ ︷︷ ︸
ρ(f,λ,η)
+k
2lnλ)
= θk(η) + h(f, η), θk(η)∆= inf
λ>0
(ρ(u, λ, η)
λ+k
2lnλ
)where h(f, η) given by a convex minimization
h(f, η) = h(f) + infν
[k
2
∫dxf(x)
[eν(x) − ν(x)− 1
]+ ηH(f ||Λ)
]where
Λ(x) =e−kν(x)/2∫e−kν(y)/2 dy
, H(f ||λ) =∫
f(x) lnf(x)Λ(x)
dx (1)
Quantization 22
Proofs
Rigorous proofs of traditional cases painfully tedious, but
follow Zador’s original approach:
Step 1 Prove the result for a uniform density on a cube.
Step 2 Prove the result for a pdf that is piecewise constant on
disjoint cubes of equal volume.
Step 3 Prove the result for a general pdf on a cube by
approximating it by a piecewise constant pdf on small cubes.
Step 4 Extend the result for a general pdf on the cube to general
pdfs on <k by limiting arguments.
Quantization 23
Heuristic proofs developed by Gersho: Assume existence of
quantizer point density function Λ(x) for sequence of codebooks
of size N
limN→∞
1N× ( # reproduction vectors in a set S)
=∫
S
Λ(x) dx; all S,
where∫<k Λ(x) dx = 1.
Assume fX(x) smooth, rate large, and minimum distortion
quantizer has cells Si ≈ scaled, rotated, and translated copies of
S∗ achieving
ck = mintesselating convex polytopes S
M(S).
Quantization 24
M(S) =1
kV (S)2/k
∫S
||x− centroid(S)||2
V (S)dx
Then hand-waving approximations of integrals relate
N(q), Df(q),Hf(q) ⇒
Df(q) ≈ ckEf
((
1N(q)Λ(X)
)2/k
)Hf(q(X)) ≈ h(X)− E
(log(
1N(q)Λ(X)
))
= lnN(q)−H(f ||Λ),
Optimizing using Holder’s inequality or Jensen’s inequality
yields classic fixed-rate and variable-rate results.
Quantization 25
Can also apply to combined-constraint case to get
θ(f, λ, η,Λ) =k
2ln
(keck
2
)+ (1− η)h(f)+
k
2ln
(Ef
((Λ(X))−2/k
))+ (1− η)Ef (lnΛ(X)) , (2)
Goal is to minimize over Λ.
If substitute θk(η) for ck term, equivalent to conjecture
proposed here with Λ(x) defined by (1)!
Suggests connections with rigorous and heuristic proofs, may
help insight and simplify proofs.
Current state: Rigorous proof for u and upper bound for other
steps. Lacking converses.
Quantization 26
Variable-rate: Asymptotically OptimalQuantizers
Theorem ⇒ for λn ↓ 0 there is a λn-asymptotically optimalsequence of quantizers qn = (αn, βn, `n):
limn→∞
(Ef [d(X,βn(αn(X)))]
λn+ Ef`n(αn(X)) +
k
2lnλn
)= θk +h(f)
⇒ λ controls distortion and entropy separately [DCC 05]:
limn→∞
2Df(qn)kλn
= 1 , limn→∞
(H(qn) +
k
2lnλn
)= h(f) + θk .
(Similar results for fixed-rate and combined constraints)
Quantization 27
Variable-rate codes: Mismatch
What if we design a.o. sequence of quantizers qn for pdf g,
but apply it to f? ⇒ mismatch
Answer in terms of relative entropy:
Theorem. [Gray, Linder (2003)] If qn is λn-a.o. for g, then
limn→∞
Df(qn)λn
+Ef`n(αn(X)) +k
2lnλn = h(f) + θk +H(f ||g).
Quantization 28
Worst case: (variable-rate) Asymptotic performance depends on
f only through h(f) ⇒ worst case source given constraints is
max h source, e.g., given m and K, worst case is Gaussian
If use worst case Gaussian for g to design codes and apply
to f , also robust in the sense of Sakrison: Gaussian codes yield
same performance on other sources with same covariance. Loss
from optimum is H(f‖g).
=⇒ suggests using relative entropy (mismatch) as a
“distance” or “distortion measure” on pdfs in order to “quantize”
the space of pdf’s in order to fit models to observed data.
Quantization 29
Classified Codes and Mixture Models
Problem with single worst case: too conservative.
Alternative idea, fit a different Gauss source to distinct groups
of inputs instead of the entire collection of inputs. Robust codes
for local behavior.
Suppose have a partition S = {Sm; m ∈ I} of <k.
If input falls in Sm, use codes designed for a Gaussian model
gm = N (µm,Km) ∈ M, the space of all nonsingular Gaussian
pdfs on <k.
Design asymptotically optimum codes qm,n; n = 1, 2, . . . for
each m and common λn → 0
Quantization 30
Construct overall composite quantizer sequence qn by using
qm,n if in Sm . Defining
fm(x) =
{f(x)/pm x ∈ Sm
0 otherwise; pm =
∫Sm
dxf(x) ⇒
limn→∞
Df(qn)λn
+Ef`n(αn(X))+k
2lnλn = θk+h(f)+
∑m
pmH(fm||gm)︸ ︷︷ ︸mismatch distortion
What is best (smallest) can make this over all partitions
S = {Sm; m ∈ I} and collections G = {gm, pm; m ∈ I}?
⇔ Quantize space of Gauss models using mismatch distortion
Outcome of optimization is a Gauss mixture.
Alternative to Baum-Welch/EM algorithm.
Quantization 31
Can use Lloyd algorithm to optimize.
Lloyd conditions for mismatch distortion imply that
Decoder (centroid) gm = argming∈M
H(fm||g) = N (µm,Km).
Encoder (partition) α(x) = argminm d(x,m) ∆=
argminm
(− ln pm +
12
ln (|Km|) +12(x− µm)tK−1
m (x− µm))
Length Function L(m) = − ln pm.
Can adjust Lagrange weighting of ln pm
Gauss mixture vector quantization (GMVQ)
Quantization 32
Toy example: GMVQ design of GM
Source: 2 component GM:
m1 = (0, 0)t, K1 =[
1 00 1
], p1 = 0.8
m2 = (−2, 2)t, K2 =[
1 11 1.5
], p2 = 0.2
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
Quantization 33
Data and Initialization:
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
Quantization 34
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
1Quantization 35
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
2Quantization 36
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
3Quantization 37
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
4Quantization 38
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
5Quantization 39
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
6Quantization 40
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
7Quantization 41
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
8Quantization 42
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
9Quantization 43
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
10Quantization 44
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
11Quantization 45
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
12Quantization 46
−6 −5 −4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
5
6
Converged!Quantization 47
Algorithm for fitting a Gauss mixture to a dataset provides
a supervised learning method for classifier design: Given a
collection of classes of interest, design for each a GMVQ
Given a new vector (e.g., image):
Traditional approach: plug in density estimate to optimal Bayes
Best codebook approach: Code the image using each of the
Gauss mixture VQs and select the class corresponding to the
quantizer with the smallest average mismatch distortion.
Codebook-matching, classification by compression.
“nearest-neighbor,” need not (explicitly) estimate PY |X
Quantization 48
Examples
Aerial images: Manmade vs. natural White: man-made,
gray:natural
original 8 bpp, gray-scale images, hand-labeled classified
images. GMVQ with probability of error 12.23%
Quantization 49
Content-Adresssible Databases/Image Retrieval
8× 8 blocks, 9 images train for each of 50 classes, 500 image
database (cross-validated)
Precision (fraction of retrieved images that are relevant) ≈recall (fraction of relevant images that are retrieved) ≈ .94 .
Query ExamplesQuery
Quantization 50
North Sea Gas Pipeline Data
normal pipeline image field joint longitudinal weld
which is which???
Quantization 51
• GMVQ Best codebook wins
• MAP-GMVQ Plug into ECVQ produced GM for MAP
• 1-NN
• MART (boosted classification tree)
• Regularized QDA (Gaussian class models)
• MAP-ECVQ (GM class models)
• MAP-EM: Plug into EM produced GM for MAP
Quantization 52
Method Recall Precision AccuracyS V W S V W
MART 0.9608 0.9000 0.8718 0.9545 0.9000 0.8947 0.9387Reg. QDA 0.9869 1.0000 0.9487 0.9869 0.9091 1.0000 0.98111-NN 0.9281 0.7000 0.8462 0.9221 0.8750 1.0000 0.8915MAP-ECVQ 0.9737 0.9000 0.9437 0.9739 0.9000 0.9487 0.9623MAP-EM 0.9739 0.9000 0.9487 0.9739 0.9000 0.9487 0.9623MAP-GMVQ 0.9935 0.8500 0.9487 0.9682 1.0000 0.9737 0.9717GMVQ 0.9673 0.8000 0.9487 0.9737 0.7619 0.9487 0.9481
Class Subclasses
S Normal, Osmosis Blisters, Black Lines, Small Black Corrosion Dots, Grinder Marks, MFL Marks,Corrosion Blisters, Single Dots
V Longitudinal Welds
W Weld Cavity, Field Joint
Recall = Pr (declare as class i | class i )Precision =Pr ( class i | declare as class i )
Implementation is simple, convergence is fast,
classification performance is good.
Quantization 53
Parting Thoughts
• Quantization ideas useful for A/D, compression, classification,
detection, and modeling. Alternative to EM for Gauss mixture
design.
• Connections with Gersho’s conjucture on asymptotic optimal
cell shapes (and its implications)
Gersho’s approach provides intuitive, but nonrigorous, proofs
of high rate results based on conjectured geometry of optimal
cell shapes.
Has many interesting implications that have never been proved:
– Existence of quantizer point density functions at high rate
(known for fixed-rate [Bucklew])
Quantization 54
– Optimal high-rate variable-rate quantizer point density is
uniform.
– At high rate optimal fixed-rate quantizers on u are maximum
entropy.
– Equality of fixed-rate and entropy constrained optimal
performance (ak = bk)
• Combined constraint approach provides
– unified development of traditional cases and provides a
formulation consistent with Gersho’s approach, allows
rigorous proofs of several of Gersho’s steps and parts of
his conjecture.
– describes separate behavior of distortion, entropy, and
codebook size of asymptotically optimum quantizers
Quantization 55
– provides equivalent conditions for implications of Gersho’s
conjectures and a possible means of eventually
demonstrating or refuting some of these implications
E.g., if Gersho true, then θk(η) is constant with zero
derivative, ak = bk, optimal fixed rate codes have maximum
entropy.
– provides a new Lloyd optimality condition: pruning
– high rate results complement Shannon results
mixed constraint theory does not have a Shannon
counterpart — in high dimensions D(R) describes both
asymptotic codebook size and quantizer entropy (asymptotic
equipartition property)
– adding components to an optimization problem may help
find better local optima.
Quantization 56