Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf ·...

10
EE 372 Introductory Lecture 4 April 2007 Quantization, Compression, and Classification: Extracting discrete information from a continuous world Robert M. Gray Information Systems Laboratory, Dept of Electrical Engineering Stanford, CA 94305 [email protected] Underlying research esearch partially supported by the National Science Foundation, Norsk Electro-Optikk, and Hewlett Packard Laboratories. Quantization 1 Introduction continuous world discrete representations encoder E 011100100 ··· 13 decoder D “Picture of a privateer” decoder D How well can you do it? What do you mean by “well”? How do you do it? A dictionary definition of quantization: Quantization 2 — the division of a quantity into a discrete number of small parts, often assumed to be integral multiples of a common quantity. In general, The mapping of a continuous quantity into a discrete quantity. Converting a continuous quantity into a discrete quantity causes loss of accuracy and information. Goal: minimize the loss but need to define “loss” first . . . and there are other constraints. Quantization 3 Quantization Old example: Round oreal numbers to nearest integer: estimate densities by histograms [Sheppard (1898)] In general, quantizer q of a space A (e.g., k , L 2 ([0, 1] 2 )= An encoder E : A I , I an index set, e.g., nonnegative integers. partition S = {S i ; i I}: S i = {x : E (x)= i}. A decoder D : I C codebook C = D ( E (A))). D i = E (X) Encoder Decoder D(E (X)) X E Quantization 4

Transcript of Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf ·...

Page 1: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

EE 372Introductory Lecture

4 April 2007

Quantization,Compression, and Classification:Extracting discrete information

from a continuous world

Robert M. GrayInformation Systems Laboratory, Dept of Electrical Engineering

Stanford, CA [email protected]

Underlying research esearch partially supported by the National Science Foundation, Norsk Electro-Optikk,and Hewlett Packard Laboratories.

Quantization 1

Introduction

continuous world

discrete representations!!

!!!"

##

###$

encoder E

011100100 · · ·

13%%%%%%&

decoder D“Picture of a privateer”

''''''(decoder D

How well can you do it?What do you mean by “well”?

How do you do it?

A dictionary definition of quantization:

Quantization 2

— the division of a quantity into a discrete number of smallparts, often assumed to be integral multiples of a commonquantity.

In general, The mapping of a continuous quantity into a discretequantity.

Converting a continuous quantity into a discrete quantity causesloss of accuracy and information.

Goal: minimize the loss

but need to define “loss” first . . .

and there are other constraints.

Quantization 3

Quantization

Old example: Round off real numbers to nearest integer:estimate densities by histograms [Sheppard (1898)]

In general, quantizer q of a space A (e.g., !k, L2([0, 1]2) =

• An encoder E : A → I, I an index set, e.g., nonnegative integers.⇔ partition S = {Si; i ∈ I}: Si = {x : E(x) = i}.

• A decoder D : I → C ⇔ codebook C = D(E(A))).

Di = E(X)

Encoder DecoderD(E(X))X E

Quantization 4

Page 2: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

⇒ q = (E ,D) = (S, C) = {Si,D(i); i ∈ I}

Examples Scalar quantizer:

) xcodebook

partition

︸ ︷︷ ︸S0

︸ ︷︷ ︸S1

︸ ︷︷ ︸S2

︸︷︷︸S3

︸ ︷︷ ︸S4

D(0)∗

D(1)∗

D(2)∗

D(3)∗

D(4)∗

Two dimensional example:Centroidal Voronoi diagram —Nearest neighbor partition,Euclidean centroidsE.g., complex numbers, nearestmail boxes, sensors, repeaters, fishfortresses . . .

Quantization 5

territories of the male Tilapiamossambica [G. W. Barlow,Hexagonal Territories, AnimalBehavior, Volume 22, 1974]

Three dimensions: Voronoipartition around sphereshttp://www-math.mit.edu/dryfluids/gallery/

Quantization 6

Performance: Distortion

Quality of a quantizer measured by the goodness of the resultingreproduction in comparison to the original.

Assume a distortion measure d(x, y) which measures penalty/costif an input x results in an output y = D(E(x)).

Small (large) distortion ⇔ good (bad) quality

Performance measured by average distortion: assume X is arandom object (variable, vector, process, field) described by probabilitydistribution PX: pixel intensities, samples, features, fields, DCT orwavelet transform coefficients, moments, . . .

E.g., if A = !k, PX ⇔ pdf f , or empirical distribution PL fromtraining/learning set L = {xl; l = 1, 2, . . . , |L|}

Quantization 7

Average Distortion:

D(q) = D(E ,D) = E[d(X,D(E(X))]

=∫

d(x,D(E(x)) dPX(x)

Examples: x, y ∈ !k

• Mean squared error (MSE) d(x, y) = ||x− y||2 =∑k

i=1 |xi − yi|2

• Input or output weighted quadratic, Bx positive definite matrix:d(x, y) = (x− y)tBx(x− y) or (x− y)tBy(x− y)

Used in perceptual coding, statistical classification.

Quantization 8

Page 3: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

• Functions of norms:

d(x, y) = ρ(||x− y||p)

where !p norm for p ≥ 1 is

||x− y||p =

(k−1∑

i=0

|xi − yi|p)1/p

ρ is convex.

Mostly of mathematical interest, limited demonstrated usefulness inapplications.

• Vector X might be pmf, distortion could be distribution distance,relative entropy (Kullback-Leibler distance), etc.

Quantization 9

• Bayes distortion, “task-driven” quantization.

Y )

SENSOR

PX|Y ) X )

ENCODER

E ) E(X) = i

DECODER

)κ )

D )

Y = κ(i)X = D(i)

e.g., X = Y + noise remote or noisy source coding

“denoise” by compression

Overall distortion: Bayes risk C(y, y) Define quantizer distortion

d(x,κ(i)) = E[C(Y, κ(i))|X = x]⇒E[d(X, κ(E(X))] = E[C(Y, κ(E(X)))]

Y discrete ⇒ classification, detectionY continuous ⇒ estimation, regression

Quantization 10

Performance: Rate

Want average distortion small, doable with big codebooks and small

cells, but there is usually a cost (instantaneous rate) r(i) ofchoosing index i, with average rate R(q) = E[r(E(X))], e.g.,

• r(i) = ln |C| Fixed-rate, classic quantization

• r(i) = !(i) Variable-rate quantization,length function ! satisfies Kraft inequality

∑i e−!(i) ≤ 1

• r(i) = − ln p(i), p(i) = Pr(E(X) = i) Shannon codelengthsR(q) = H(E(X)) = −

∑i p(i) ln p(i) entropy coding

• r(i) = (1− η)!(i) + η ln |C|, η ∈ [0, 1] Combined constraintsProposed by Zador (1982) to unify separate cases.

Quantization 11

Quantizer Model

! best thought of as part of quantizer, a design choice.

Alternative view: Codeword weighting: w(i) = e−!(i), w ⇔ !

!(i) =∞⇔ w(i) = 0; infinite cost, never used.

N(w) = |{i : w(i) > 0}| = |{i : !(i) finite }| codebook size

Kraft inequality on ! ⇔ w a sub-pmf: w(i) ≥ 0,∑

i w(i) ≤ 1

r(i) = (1− η) ln 1w(i) + η lnN(w),

⇒ p must be absolutely continuous wrt w for finite rate.

quantizer model is q = (E ,D, !) = (S, C, w)

Quantization 12

Page 4: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

Major Issues

• If know distributions,

– what is optimal tradeoff of distortion D(E ,D) vs. rate R(E , w)?(Want both to be small!)

– How design good codes?

• If do not know distributions, how use training/learning data toestimate tradeoffs and design good codes?(clustering, statistical or machine learning)

Quantization 13

Applications

• Communications & signalprocessing: A/D conversion, datacompression. Compression requiredfor efficient transmission– send more data in available

bandwidth– send the same data in less bandwidth– more users on same bandwidthand storage

Graphic courtesy of Jim Storer.

Quantization 14

• Statistical clustering: grouping bird songs, designing Mao suits,grouping gene features, taxonomy.

• Placing points in space: mailboxes, wireless sensors

• Image and video classification/segmentation

• Speech recognition and speaker identification

• Numerical integration and optimal quadrature rules, e.g., How bestchose {Si,D(i)} for

∫g(x)f(x) dx ≈

∑i PX(Si)g(D(i))?

• Approximating continuous probabilistic models by discrete models.Simulating continuous processes.

• Fitting and choosing Gauss mixture models for observed data.

Quantization 15

Distortion-rate optimization: q = (E ,D, w)

D(q) = E[d(X,D(E(X)))] vs. R(q) = E[r(E(X))]

Minimize D(q) for R(q) ≤ R δ(R) = infq:R(q)≤R

D(q)

Minimize R(q) for D(q) ≤ D r(D) = infq:D(q)≤D

R(q)

Lagrangian approach: ρλ(x, i) = d(x,D(i)) + λr(i), λ ≥ 0

Minimize Lagrangian distortion ρ(f,λ, η, q) ∆= Eρλ(X, E(X))

= D(q) + λR(q)

= Ed(X,D(E(X))) + λ [(1− η)E!(E(X)) + η lnN(!)]

ρ(f,λ, η) = infq

ρ(f,λ, η, q)

Traditional cases: fixed-rate η = 1, variable-rate η = 0

Quantization 16

Page 5: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

Three Theories of Quantization

Rate-Distortion Theory Shannon distortion-rate function

D(R) ≤ δ(R), achievable in asymptopia of large dimension k and

fixed rate R. [Shannon (1949, 1959), Gallager (1978)]

( Nonasymptotic (Exact) results Necessary conditions for optimalcodes ⇒ iterative design algorithms⇔ statistical clustering [Steinhaus (1956), Lloyd (1957)]

( High rate theory Optimal performance in asymptopia of fixed

dimension k and large rate R . [Bennett (1948), Lloyd (1957),

Zador (1963), Gersho (1979), Bucklew, Wise (1982)]

Quantization 17

Lloyd Optimality Properties: q = (E ,D, !)

Any component can be optimized for the others.

Encoder E(x) = argmini

(d(x,D(i)) + λr(i)) Minimum distortion

Decoder D(i) = argminy

E(d(X, y)| E(X) = i) Lloyd centroid

Codebook size Remove i if Pr(E(X) = i) = 0 PruneLength function !(i) = − lnPX(E(X) = i) Shannon codelength

Iterate steps ⇒ Lloyd clustering algorithm, (rediscovered as k-means,grouped coordinate descent, alternating optimization, principal points)

If fixed length, squared error =⇒ Centroidal Voronoi partition

centroid = D(i) = E(X| E(X) = i) (MMSE estimate)

Quantization 18

Lloyd conditions ⇒• Encoder partition S ⇒ optimal decoder D = D(S) and lengthfunction ! = !(S)• Decoder D + length function ! ⇒ optimal encoder partition S =S(D, !)

Equivalent optimization problems:

ρ(f,λ, η) = infS,D,w

ρ(f,λ, η,S,D, w)

= infS

ρ(f,λ, η,S)

= infD,w

ρ(f,λ, η,D, w)

Good quantizer characterized by either partition S or by weighteddecoder (D, w). Suggests another Lloyd-style optimality condition.

Quantization 19

Subcodes and Supercodes, Pruning and Growing

A partition S ′ is a subpartition of a partition S if every atom of S ′is a union of atoms of S: S refines S ′ or S ′ ⊂ S

A quantizer q′ determined by S ′ is a partition subcode of thequantizer q determined by S if S ′ is a subpartition of S. q is apartition supercode of q′.

A weighted codebook (D′, w′) is a codebook subcode of a weightedcodebook (D, w) if {D′(i), i : w′(i) > 0} ⊂ {D(i), i : w(i) > 0} andw′i = αwi,α ∈ (0, 1). Conversely, (D, w) is a codebook supercode of(D′, w′).

Generally the smaller code has larger distortion and smaller rate.

Quantization 20

Page 6: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

Provides an additional Lloyd-style necessary condition:

Codebook size If a quantizer q is optimal there can be nosubcode or supercode q′ for whichD(q′) + λR(q′) < D(q) + λR(q). pruning/growing

Generalization of zero-probability cell pruning.

Can test likely subcodes/supercodes for possible improvement.

Can also use to increase/decrease λ and grow or prune code.

Combining everything ⇒ Lloyd clustering algorithm

A descent algorithm, so distortion converges.

Quantization 21

Step 0: Initialization Given initial (D0, w0).Compute ρ0 = ρ(S(D0, w0),D0, w0). Set m = 1

Step 1: Partition improvement Given(Dm−1, wm−1), form anoptimum partition Sm = S(Dm−1, wm−1).

Step 2: Weighted codebook improvement Given the partitionSm, form an optimum weighted codebook (Dm, wm) =(D(Sm), w(Sm)). Compute ρm = ρ(Sm,Dm, wm).

Step 3: Test Test ρm−1 − ρm. If small enough, go to Step 4.Else set m = m + 1, go to Step 1.

Step 4: Grow/Prune Test sub/super codes for possibleimprovement for fixed λ or for changed λ. Quit or go tostep 1.

Quantization 22

Bayes quantization

If use Bayes distortion for statistical classification or regressionwith a rate constraint, e.g., with Hamming Bayes cost:E[C(Y, κ(E(X)))] = Pe and

κ(i) = argmaxy

Pr(Y = y| E(X) = i)

E(x) = argmaxi

[Pr(Y = κ(i)|X = x)− λr(i)]

Need to estimate PY |X via model or binning/quantization of X

Can add constraints, e.g., constrain encoder to have low-complexitytree structure. (CART/BFOS/TSVQ)

Quantization 23

High Rate (Resolution) Theory

Traditional form (Zador, Gersho, Bucklew, Wise): f , MSE

limR→∞

e2kRδ(R) =

ak||f || k

k+2fixed-rate (R = ln N)

bke2kh(f) variable-rate (R = H), where

h(f) = −∫

f(x) ln f(x) dx, ||f ||k/(k+2) =(∫

f(x)k

k+2 dx

)k+2k

ak ≥ bk are Zador constants (depend on k and d, not f !)

a1 = b1 =112

, a2 =5

18√

3, b2 = ?, ak, bk = ? for k ≥ 3

limk→∞

ak = limk→∞

bk = 1/2πe

Quantization 24

Page 7: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

Lagrangian form: variable-rate

limλ→0

[infq

(ρ(f,λ, 0, q)

λ

)+

k

2lnλ

]= θk + h(f)

θk∆= inf

λ>0

[infq

(ρ(u, λ, 0, q)

λ

)+

k

2lnλ

]=

k

2ln

2ebk

k

limλ→0

[infq

(ρ(f,λ, 1, q)

λ

)+

k

2lnλ

]= ψk + ln ||f ||k/2

k/(k+2)

ψk∆= inf

λ>0

[infq

(ρ(u, λ, 1, q)

λ

)+

k

2lnλ

]=

k

2ln

2eak

k

fixed-rate both have form optimum for uniform u + function of f

Quantization 25

Proofs

Rigorous proofs of traditional cases painfully tedious, but followZador’s original approach.

Heuristic proofs developed by Gersho based on assumptions of

• existence of asymptotically optimal quantizer point density functionΛ(x)

• existence of asymptotically optimal quantizer partition cell shape

Quantization 26

Then hand-waving approximations of integrals relateN(q), Df(q),Hf(q)⇒

Df(q) ≈ ckEf

((

1N(q)Λ(X)

)2/k

)

Hf(q(X)) ≈ h(X)− E

(log(

1N(q)Λ(X)

))

= ln N(q)−∫

f(x) lnf(x)Λ(x)

dx︸ ︷︷ ︸

H(f ||Λ) relative entropy

,

Optimizing using Holder’s inequality or Jensen’s inequality yieldsclassic fixed-rate and variable-rate results.

Quantization 27

Can also apply to combined-constraint case to get conjecturedsolution for general η

There appear to be connections with rigorous and heuristic proofs,may help insight and simplify proofs.

Gersho’s conjectures and approximations have many yet unprovedbut commonly believed implications,

e.g., ak = bk all k, the best fixed-rate high rate codes have maximumentropy, asymptotically optimal quantizer point density functions exist.

Henceforth focus on variable-rate case, η = 0.

Quantization 28

Page 8: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

Mismatch distortion

High-rate results ⇒ for small λ optimal quantizer satisfies

ρ(f,λ, 0, q)λ

+k

2lnλ ≈ θk + h(f)

Worst case: Performance depends on f only through h(f) ⇒ worstcase source given constraints is max h source, e.g., given m and K,worst case is Gaussian

Mismatch: Design for pdf g, but apply it to f for small λ

ρ(f,λ, 0, q)λ

+k

2lnλ ≈ h(f) + θk︸ ︷︷ ︸

optimal for f

+ H(f ||g)︸ ︷︷ ︸mismatch

.

Quantization 29

Put together: If g Gaussian with same mean, covariance as f then

ρ(f,λ, 0, q)λ

− ρ(g,λ, 0, q)λ

≈ 0

⇒ the performance of the code designed for g asymptotically equalsthat when the code is applied to f , a robust code in informationtheoretic sense!(≈ “generalization guarantees” in machine learning)

Robust and worst-case code ⇒ H(f‖g) measures mismatchdistance from f to Gaussian g with same second-order moments,I.e., for a pdf f , how much performance is lost from optimal for f ifuse optimal code for g on f?

Problem with single worst case: too conservative

Gaussian can be a very bad fit, very suboptimal, e.g., speech.

Quantization 30

Divide & conquer: composite code

Alternative idea, fit a different Gauss source to distinct groups ofinputs instead of the entire collection of inputs. Robust codes for localbehavior.

Partition input space !k into S = {Sm; m = 1, . . . ,M}, assign aGaussian component gm = N (µm,Km) to each Sm, & construct anoptimal quantizer qm = (Em,Dm, !m) for each gm:

ρ(f,λ, 0, q)λ

+k

2lnλ ≈ θk + h(f)︸ ︷︷ ︸

optimal for f

+∑

m

pmH(fm||gm)︸ ︷︷ ︸

mismatch distortion

where fm(x) = f(x)/pm, x ∈ Sm, pm =∫

Smdxf(x).

Quantization 31

What is best (smallest) can make∑

m pmH(fm||gm) over allpartitions S = {Sm; m ∈ I} and collections G = {gm, pm; m ∈ I}?⇔Quantize !k into space of Gauss models using quantizer mismatchdistortion

dQM(x,m) = − ln pm +12

ln |Km| +12(x− µm)tK−1

m (x− µm)

Similar to MDI distortion, Itakura-Saito distortion, log likelihooddistortion for Gaussians, Stein’s spectral distortion, . . .

Codebook {(µm,Km, pm)} ↔ Gauss mixture model.

Can use Lloyd algorithm to optimize: Gauss mixture vectorquantization (GMVQ)

Quantization 32

Page 9: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

• GMVQ converts a training sequence into a GM model

• Derivation was based on high-rate analysis of quantizing the originalsource using a collection of codes designed for local Gaussian models.Minimizing the overall MSE ⇔ GMVQ

• Alternative to Baum-Welch/EM algorithm for GM design.

Advantages over EM:

• Lower complexity (about 1/2).

• Faster convergence.

• Incorporates compression ⇒ can use low complexity search, e.g.,TSVQ, useful for large dimensions, large datasets, networks.

Quantization 33

Examples

Aerial images: Manmade vs. natural White: man-made,gray:natural

original 8 bpp, gray-scale images, hand-labeled classified images.GMVQ with probability of error 12.23%

Quantization 34

Content-Adresssible Databases/Image Retrieval

8 × 8 blocks, 9 images train for each of 50 classes, 500 imagedatabase (cross-validated)Precision (fraction of retrieved images that are relevant) ≈recall (fraction of relevant images that are retrieved) ≈ .94 .Query Examples

Query

Quantization 35

Texture Classification

Examples from Brodatz databaseCompared well with Gauss mixture random field methods.

Quantization 36

Page 10: Qu anti zation, Com pression, and ClassiÞcation: Extracting …yxie77/ece587/introqcc-multi.pdf · 2007. 5. 8. · EE 372 Introductory Lec ture 4 Ap ril2007 Qu anti zation, Com pression,

North Sea Gas Pipeline Data

normal pipeline image field joint longitudinal weldwhich is which???

Quantization 37

• GMVQ Best codebook wins

• MAP-GMVQ Plug in GMVQ produced GMM to MAP

• MAP-EM: Plug into EM produced GMM to MAP

• 1-NN

• MART (boosted classification tree)

• Regularized QDA (Gaussian class models)

• MAP-ECVQ (GM class models)

Quantization 38

Method Recall Precision AccuracyS V W S V W

MART 0.9608 0.9000 0.8718 0.9545 0.9000 0.8947 0.9387Reg. QDA 0.9869 1.0000 0.9487 0.9869 0.9091 1.0000 0.98111-NN 0.9281 0.7000 0.8462 0.9221 0.8750 1.0000 0.8915MAP-ECVQ 0.9737 0.9000 0.9437 0.9739 0.9000 0.9487 0.9623MAP-EM 0.9739 0.9000 0.9487 0.9739 0.9000 0.9487 0.9623MAP-GMVQ 0.9935 0.8500 0.9487 0.9682 1.0000 0.9737 0.9717GMVQ 0.9673 0.8000 0.9487 0.9737 0.7619 0.9487 0.9481

Class SubclassesS Normal, Osmosis Blisters, Black Lines, Small Black Corrosion Dots, Grinder Marks, MFL Marks,

Corrosion Blisters, Single DotsV Longitudinal WeldsW Weld Cavity, Field Joint

Recall = Pr (declare as class i | class i )Precision =Pr ( class i | declare as class i )

Implementation is simple, convergence is fast,classification performance is good.

Quantization 39

Closing Introductory Thoughts

• Quantization ideas useful for A/D conversion, quantizer error analysis,data compression, statistical classification, modeling, and estimationwith a rate constraint.

• Lloyd-designed GMVQ alternative to EM for Gauss mixture design instatistical classification and clustering.

• Gersho’s approach provides intuitive, but nonrigorous, proofs of highrate results based on conjectured geometry of optimal cell shapes.

New tools may help develop rigorous proof reflecting simple heuristics& means of proving/disproving implications of Gersho’s conjecture.

• Basic tools include information theory, signal processing,optimization.

Quantization 40