IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5,...

16
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009 1945 Joint Universal Lossy Coding and Identification of Stationary Mixing Sources With General Alphabets Maxim Raginsky, Member, IEEE Abstract—In this paper, we consider the problem of joint universal variable-rate lossy coding and identification for para- metric classes of stationary -mixing sources with general (Polish) alphabets. Compression performance is measured in terms of Lagrangians, while identification performance is measured by the variational distance between the true source and the estimated source. Provided that the sources are mixing at a sufficiently fast rate and satisfy certain smoothness and Vapnik–Chervo- nenkis (VC) learnability conditions, it is shown that, for bounded metric distortions, there exist universal schemes for joint lossy compression and identification whose Lagrangian redundancies converge to zero as as the block length tends to infinity, where is the VC dimension of a certain class of decision regions defined by the -dimensional marginal distributions of the sources; furthermore, for each , the decoder can identify -dimensional marginal of the active source up to a ball of radius in variational distance, eventually with proba- bility one. The results are supplemented by several examples of parametric sources satisfying the regularity conditions. Index Terms—Learning, minimum-distance density estimation, two-stage codes, universal vector quantization, Vapnik–Chervo- nenkis (VC) dimension. I. INTRODUCTION I T is well known that lossless source coding and statistical modeling are complementary objectives. This fact is cap- tured by the Kraft inequality (see [1, Sec. 5.2]), which pro- vides a correspondence between uniquely decodable codes and probability distributions on a discrete alphabet. If one has full knowledge of the source statistics, then one can design an op- timal lossless code for the source, and vice versa. However, in practice it is unreasonable to expect that the source statistics are known precisely, so one has to design universal schemes that perform asymptotically optimally within a given class of sources. In universal coding, too, as Rissanen has shown in [2] and [3], the coding and modeling objectives can be accom- plished jointly: given a sufficiently regular parametric family of discrete-alphabet sources, the encoder can acquire the source statistics via maximum-likelihood estimation on a sufficiently Manuscript received December 19, 2006; revised January 08, 2009. Current version published April 22, 2009. This work was supported by the Beckman Institute Fellowship. The material in this paper was presented in part at the IEEE International Symposium on Information Theory, (ISIT) Nice, France, June 2007. The author was with the Beckman Institute for Advanced Science and Tech- nology, University of Illinois, Urbana, IL 61801 USA. He is now with the De- partment of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA (e-mail: [email protected]). Communicated by M. Effros, Associate Editor for Source Coding. Digital Object Identifier 10.1109/TIT.2009.2015987 long data sequence and use this knowledge to select an appro- priate coding scheme. Even in nonparametric settings (e.g., the class of all stationary ergodic discrete-alphabet sources), uni- versal schemes such as Ziv–Lempel [4] amount to constructing a probabilistic model for the source. In the reverse direction, Kieffer [5] and Merhav [6], among others, have addressed the problem of statistical modeling (parameter estimation or model identification) via universal lossless coding. Once we consider lossy coding, though, the relationship be- tween coding and modeling is no longer so simple. On the one hand, having full knowledge of the source statistics is certainly helpful for designing optimal rate-distortion codebooks. On the other hand, apart from some special cases [e.g., for indepen- dent and identically distributed (i.i.d.) Bernoulli sources and the Hamming distortion measure or for i.i.d. Gaussian sources and the squared-error distortion measure], it is not at all clear how to extract a reliable statistical model of the source from its repro- duction via a rate-distortion code (although, as shown recently by Weissman and Ordentlich [7], the joint empirical distribution of the source realization and the corresponding codeword of a “good” rate-distortion code converges to the distribution solving the rate-distortion problem for the source). This is not a problem when the emphasis is on compression, but there are situations in which one would like to compress the source and identify its sta- tistics at the same time. For instance, in indirect adaptive control (see, e.g., [8, Ch. 7]) the parameters of the plant (the controlled system) are estimated on the basis of observation, and the con- troller is modified accordingly. Consider the discrete-time sto- chastic setting, in which the plant state sequence is a random process whose statistics are governed by a finite set of param- eters. Suppose that the controller is geographically separated from the plant and connected to it via a noiseless digital channel whose capacity is bits per use. Then, given the time horizon , the objective is to design an encoder and a decoder for the controller to obtain reliable estimates of both the plant parame- ters and the plant state sequence from the possible outputs of the decoder. To state the problem in general terms, consider an information source emitting a sequence of random variables taking values in an alphabet . Suppose that the process distri- bution of is not specified completely, but it is known to be a member of some parametric class . We wish to answer the following two questions: 1) Is the class universally encodable with re- spect to a given single-letter distortion measure , by codes with a given structure (e.g., all fixed-rate block codes with a given per-letter rate, all variable-rate block codes, etc.)? 0018-9448/$25.00 © 2009 IEEE Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

Transcript of IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5,...

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009 1945

Joint Universal Lossy Coding and Identification ofStationary Mixing Sources With General Alphabets

Maxim Raginsky, Member, IEEE

Abstract—In this paper, we consider the problem of jointuniversal variable-rate lossy coding and identification for para-metric classes of stationary �-mixing sources with general (Polish)alphabets. Compression performance is measured in terms ofLagrangians, while identification performance is measured by thevariational distance between the true source and the estimatedsource. Provided that the sources are mixing at a sufficientlyfast rate and satisfy certain smoothness and Vapnik–Chervo-nenkis (VC) learnability conditions, it is shown that, for boundedmetric distortions, there exist universal schemes for joint lossycompression and identification whose Lagrangian redundanciesconverge to zero as �� ������ as the block length � tends toinfinity, where �� is the VC dimension of a certain class of decisionregions defined by the �-dimensional marginal distributions ofthe sources; furthermore, for each �, the decoder can identify�-dimensional marginal of the active source up to a ball of radius�� �� ������� in variational distance, eventually with proba-bility one. The results are supplemented by several examples ofparametric sources satisfying the regularity conditions.

Index Terms—Learning, minimum-distance density estimation,two-stage codes, universal vector quantization, Vapnik–Chervo-nenkis (VC) dimension.

I. INTRODUCTION

I T is well known that lossless source coding and statisticalmodeling are complementary objectives. This fact is cap-

tured by the Kraft inequality (see [1, Sec. 5.2]), which pro-vides a correspondence between uniquely decodable codes andprobability distributions on a discrete alphabet. If one has fullknowledge of the source statistics, then one can design an op-timal lossless code for the source, and vice versa. However, inpractice it is unreasonable to expect that the source statisticsare known precisely, so one has to design universal schemesthat perform asymptotically optimally within a given class ofsources. In universal coding, too, as Rissanen has shown in[2] and [3], the coding and modeling objectives can be accom-plished jointly: given a sufficiently regular parametric family ofdiscrete-alphabet sources, the encoder can acquire the sourcestatistics via maximum-likelihood estimation on a sufficiently

Manuscript received December 19, 2006; revised January 08, 2009. Currentversion published April 22, 2009. This work was supported by the BeckmanInstitute Fellowship. The material in this paper was presented in part at theIEEE International Symposium on Information Theory, (ISIT) Nice, France,June 2007.

The author was with the Beckman Institute for Advanced Science and Tech-nology, University of Illinois, Urbana, IL 61801 USA. He is now with the De-partment of Electrical and Computer Engineering, Duke University, Durham,NC 27708 USA (e-mail: [email protected]).

Communicated by M. Effros, Associate Editor for Source Coding.Digital Object Identifier 10.1109/TIT.2009.2015987

long data sequence and use this knowledge to select an appro-priate coding scheme. Even in nonparametric settings (e.g., theclass of all stationary ergodic discrete-alphabet sources), uni-versal schemes such as Ziv–Lempel [4] amount to constructinga probabilistic model for the source. In the reverse direction,Kieffer [5] and Merhav [6], among others, have addressed theproblem of statistical modeling (parameter estimation or modelidentification) via universal lossless coding.

Once we consider lossy coding, though, the relationship be-tween coding and modeling is no longer so simple. On the onehand, having full knowledge of the source statistics is certainlyhelpful for designing optimal rate-distortion codebooks. On theother hand, apart from some special cases [e.g., for indepen-dent and identically distributed (i.i.d.) Bernoulli sources and theHamming distortion measure or for i.i.d. Gaussian sources andthe squared-error distortion measure], it is not at all clear how toextract a reliable statistical model of the source from its repro-duction via a rate-distortion code (although, as shown recentlyby Weissman and Ordentlich [7], the joint empirical distributionof the source realization and the corresponding codeword of a“good” rate-distortion code converges to the distribution solvingthe rate-distortion problem for the source). This is not a problemwhen the emphasis is on compression, but there are situations inwhich one would like to compress the source and identify its sta-tistics at the same time. For instance, in indirect adaptive control(see, e.g., [8, Ch. 7]) the parameters of the plant (the controlledsystem) are estimated on the basis of observation, and the con-troller is modified accordingly. Consider the discrete-time sto-chastic setting, in which the plant state sequence is a randomprocess whose statistics are governed by a finite set of param-eters. Suppose that the controller is geographically separatedfrom the plant and connected to it via a noiseless digital channelwhose capacity is bits per use. Then, given the time horizon

, the objective is to design an encoder and a decoder for thecontroller to obtain reliable estimates of both the plant parame-ters and the plant state sequence from the possible outputsof the decoder.

To state the problem in general terms, consider an informationsource emitting a sequence of random variablestaking values in an alphabet . Suppose that the process distri-bution of is not specified completely, but it is known to be amember of some parametric class . We wish toanswer the following two questions:

1) Is the class universally encodable with re-spect to a given single-letter distortion measure , by codeswith a given structure (e.g., all fixed-rate block codes witha given per-letter rate, all variable-rate block codes, etc.)?

0018-9448/$25.00 © 2009 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1946 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

In other words, does there exist a scheme that is asymptot-ically optimal for each , ?

2) If the answer to question 1) is positive, can the codes beconstructed in such a way that the decoder can not onlyreconstruct the source, but also identify its process distri-bution , in an asymptotically optimal fashion?

In previous work [9], [10], we have addressed these two ques-tions in the context of fixed-rate lossy block coding of stationarymemoryless (i.i.d.) continuous-alphabet sources with parameterspace a bounded subset of for some finite . We haveshown that, under appropriate regularity conditions on the dis-tortion measure and on the source models, there exist joint uni-versal schemes for lossy coding and source identification whoseredundancies (that is, the gap between the actual performanceand the theoretical optimum given by the Shannon distortion-rate function) and source estimation fidelity both converge tozero as , as the block length tends to infinity.The code operates by coding each block with the code matchedto the source with the parameters estimated from the precedingblock. Comparing this convergence rate to the conver-gence rate, which is optimal for redundancies of fixed-rate lossyblock codes [11], we see that there is, in general, a price to bepaid for doing compression and identification simultaneously.Furthermore, the constant hidden in the notation increaseswith the “richness” of the model class , as mea-sured by the Vapnik–Chervonenkis (VC) dimension [12] of acertain class of measurable subsets of the source alphabet asso-ciated with the sources.

The main limitation of the results of [9] and [10] is the i.i.d.assumption, which is rather restrictive as it excludes many prac-tically relevant model classes (e.g., autoregressive sources, orMarkov and hidden Markov processes). Furthermore, the as-sumption that the parameter space is bounded may not alwayshold, at least in the sense that we may not know the diameter of

a priori. In this paper, we relax both of these assumptions andstudy the existence and the performance of universal schemesfor joint lossy coding and identification of stationary sourcessatisfying a mixing condition, when the sources are assumed tobelong to a parametric model class , beingan open subset of for some finite . Because the parameterspace is not bounded, we have to use variable-rate codes withcountably infinite codebooks, and the performance of the codeis assessed by a composite Lagrangian functional [13], whichcaptures the tradeoff between the expected distortion and theexpected rate of the code. Our result is that, under certain reg-ularity conditions on the distortion measure and on the modelclass, there exist universal schemes for joint lossy source codingand identification such that, as the block length tends to in-finity, the gap between the actual Lagrangian performance andthe optimal Lagrangian performance achievable by variable-ratecodes at that block length, as well as the source estimation fi-delity at the decoder, converge to zero as ,where is the VC dimension of a certain class of decisionregions induced by the collection of the -di-mensional marginals of the source process distributions.

This result shows very clearly that the price to be paid foruniversality, in terms of both compression and identification,

grows with the richness of the underlying model class, as cap-tured by the VC dimension sequence . The richer the modelclass, the harder it is to learn, which affects the compressionperformance of our scheme because we use the source param-eters learned from past data to decide how to encode the cur-rent block. Furthermore, comparing the rate at which the La-grangian redundancy decays to zero under our scheme with the

result of Chou, Effros, and Gray [14], whose uni-versal scheme is not aimed at identification, we immediately seethat, in ensuring to satisfy the twin objectives of compressionand modeling, we inevitably sacrifice some compression per-formance.

The paper is organized as follows. Section II introducesnotation and basic concepts related to sources, codes, and VCclasses. Section III lists and discusses the regularity conditionsthat have to be satisfied by the source model class, and containsthe statement of our result. The result is proved in Section IV.Next, in Section V, we give three examples of parametricsource families (namely, i.i.d. Gaussian sources, Gaussianautoregressive sources, and hidden Markov processes), whichfit the framework of this paper under suitable regularity con-ditions. We conclude in Section VI and outline directions forfuture research. Finally, the Appendix contains some technicalresults on Lagrange-optimal variable-rate quantizers.

II. PRELIMINARIES

A. Sources

In this paper, a source is a discrete-time stationary ergodicrandom process with alphabet . We assumethat is a Polish space (i.e., a complete separable metric space1)and equip with its Borel -field. For any pair of indices

with , let denote the segmentof . If is the process distribution of , then we letdenote expectation with respect to , and let denote themarginal distribution of . Whenever carries a subscript,e.g., , we write instead. We assume that thereexists a fixed -finite measure on , such that the -dimen-sional marginal of any process distribution of interest is abso-lutely continuous with respect to the product measure , forall . We denote the corresponding densities by

. To avoid notational clutter, we omit the superscript from, , and whenever it is clear from the argument, as in

, , or .Given two probability measures and on a measurable

space , the variational distance between them is definedby

where the supremum is over all finite -measurable partitionsof (see, e.g., [15, Sec. 5.2]). If and are the densities of

and , respectively, with respect to a dominating measure ,then we can write

1The canonical example is the Euclidean space for some � ��.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

RAGINSKY: JOINT UNIVERSAL LOSSY CODING AND IDENTIFICATION OF STATIONARY MIXING SOURCES 1947

A useful property of the variational distance is that, for any mea-surable function , .When and are -dimensional marginals of and , re-spectively, i.e., and , we write for

. If is a -subfield of , we define the variationaldistance between and with respect to by

where the supremum is over all finite -measurable partitionsof . Given a and a probability measure , the vari-ational ball of radius around is the set of all probabilitymeasures with .

Given a source with process distribution , let anddenote the marginal distributions of on and

, respectively. For each , the th-order absoluteregularity coefficient (or -mixing coefficient) of is defined as[16], [17]

where the supremum is over all finite -measurable par-titions and all finite -measurable partitions .Observe that

(1)

the variational distance between and the product distributionwith respect to the -algebra . Since

is stationary, we can “split” its process distribution at anypoint and define equivalently by

(2)

Again, if is subscripted by some , , then we write.

B. Codes

The class of codes we consider here is the collection of all fi-nite-memory variable-rate vector quantizers. Let be a repro-duction alphabet, also assumed to be Polish. We assume that

is a subset of a Polish space with a bounded metric: there exists some , such that

for all . We take ,, as our (single-letter) distortion function. A

variable-rate vector quantizer with block length and memorylength is a pair , whereis the encoder, is the decoder, andis a countable collection of binary strings satisfying the prefixcondition or, equivalently, the Kraft inequality

where denotes the length of in bits. The mapping of thesource into the reproduction process is defined by

That is, the encoding is done in blocks of length , but the en-coder is also allowed to observe the symbols immediatelypreceding each block. The effective memory of is definedas the set , such that

The size of is called the effective memory length of. We will often use to also denote the composite

mapping : . When the codehas zero memory , we will denote it more compactlyby .

The performance of the code on the source with process dis-tribution is measured by its expected distortion

where for and ,is the per-letter distortion incurred in

reproducing by , and by its expected rate

where denotes the length of a binary string in bits, nor-malized by . (We follow Neuhoff and Gilbert [18] and nor-malize the distortion and the rate by the length of the repro-duction block, not by the combined length of the sourceblock plus the memory input.) When working with variable-ratequantizers, it is convenient [13], [19] to absorb the distortion andthe rate into a single performance measure, the Lagrangian dis-tortion

where is the Lagrange multiplier which controls thedistortion-rate tradeoff. Geometrically, is the -in-tercept of the line with slope , passing through the point

in the rate-distortion plane [20]. Ifcarries a subscript, , then we write , , and

.

C. Vapnik–Chervonenkis Classes

In this paper, we make heavy use of VC theory (see [21],[22], [23], or [24] for detailed treatments). This section con-tains a brief summary of the needed concepts and results. Let

be a measurable space. For any collection ofmeasurable subsets of and any -tuple , define theset consisting of all distinct binary strings ofthe form , . Then

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1948 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

is called the th shatter coefficient of . The VC dimension of ,denoted by , is defined as the largest for which

(if for all , then we set ).If , then is called a VC class. If is a VC classwith , then it follows from the results of Vapnik andChervonenkis [12] and Sauer [25] that .

For a VC class , the so-called VC inequalities (see Lemma2.1) relate its VC dimension to maximal deviations of theprobabilities of the events in from their relative frequencieswith respect to an i.i.d. sample of size . For any , let

denote the induced empirical distribution, where is the Diracmeasure (point mass) concentrated at . We then have the fol-lowing.

Lemma 2.1 (VC Inequalities): Let be a probability measureon , and an -tuple of independentrandom variables with , . Let be a VC classwith . Then, for every

(3)

and

(4)

where is a universal constant. The probabilities and expec-tations are with respect to the product measure on .

Remark 2.1: A more refined technique involving metric en-tropies and empirical covering numbers, due to Dudley [26], canyield a much better bound on the expected maximaldeviation between the true and the empirical probabilities. Thisimprovement, however, comes at the expense of a much largerconstant hidden in the notation.

Finally, we will need the following lemma, which is a simplecorollary of the results of Karpinski and Macintyre [27] (seealso [24, Sec. 10.3.5]).

Lemma 2.2: Let be a collection ofmeasurable subsets of , such that

where for each , is a polynomial of degreein the components of . Then, is a VC class with

.

III. STATEMENT OF RESULTS

In this section, we state our result concerning universalschemes for joint lossy compression and identification of sta-tionary sources under certain regularity conditions. We workin the usual setting of universal source coding: we are givena source whose process distribution is knownto be a member of some parametric class . The

parameter space is an open subset of the Euclidean spacefor some finite , and we assume that has nonempty

interior. We wish to design a sequence of variable-rate vectorquantizers, such that the decoder can reliably reconstruct theoriginal source sequence and reliably identify the activesource in an asymptotically optimal manner for all . Webegin by listing the regularity conditions.

Condition 1: The sources in are algebraically-mixing: there exists a constant , such that

where the constant implicit in the notation may depend on.

This condition ensures that certain finite-block functions ofthe source can be approximated in distribution by i.i.d. pro-cesses, so that we can invoke the VC machinery of Section II-C.This “blocking” technique, which we exploit in the proof of ourTheorem 3.1, dates back to Bernstein [28], and was used by Yu[29] to derive rates of convergence in the uniform laws of largenumbers for stationary mixing processes, and by Meir [30] inthe context of nonparametric adaptive prediction of stationarytime series. As an example of when an even stronger decay con-dition holds, let be a finite-order autoregressivemoving-average (ARMA) process driven by a zero-mean i.i.d.process , i.e., there exist positive integers , , and

real constants such that

Mokkadem [31] has shown that, provided the common distri-bution of the is absolutely continuous and the roots of thepolynomial lie outside the unit circle in thecomplex plane, the -mixing coefficients of decay to zeroexponentially.

Condition 2: For each , there exist constantsand , such that

for all in the open ball of radius centered at , whereis the Euclidean norm on .

This condition guarantees that, for any sequence ofpositive reals such that

as

and any sequence in satisfying fora given , we have

as

It is weaker (i.e., more general) than the conditions of Rissanen[2], [3] which control the behavior of the relative entropy (in-formation divergence) as a function of the source parameters in

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

RAGINSKY: JOINT UNIVERSAL LOSSY CODING AND IDENTIFICATION OF STATIONARY MIXING SOURCES 1949

terms of the Fisher information and related quantities. Indeed,for each , let

be the normalized th-order relative entropy (informationdivergence) between and . Suppose that, for each ,

is twice continuously differentiable as a func-tion of . Let lie in an open ball of radius around .Since attains its minimum at , the gradient

evaluated at is zero, and we can writethe second-order Taylor expansion of about as

(5)

where the Hessian matrix

under additional regularity conditions, is equal to the Fisher in-formation matrix

(see [32]). Assume now, following [2] and [3], that the sequenceof matrix norms is bounded (by a constant dependingon ). Then, we can write

i.e., the normalized relative entropies are locallyquadratic in . Then, Pinsker’s inequality (see, e.g., [15,Lemma 5.2.8]) implies that ,and we recover our Condition 2. Rissanen’s condition, whilestronger than our Condition 2, is easier to check, the fact whichwe exploit in our discussion of examples of Section V.

Condition 3: For each , let be the collection of all setsof the form

Then we require that, for each , is a VC class, and.

This condition is satisfied, for example, whenindependently of , or when . The use of the

class dates back to the work of Yatracos [33] on minimum-distance density estimation. The ideas of Yatracos were furtherdeveloped by Devroye and Lugosi [34], [35], who dubbedthe Yatracos class (associated with the densities ). We willadhere to this terminology. To give an intuitive interpretation to

, let us consider a pair of distinct parameter vectorsand note that the set consists of all

for which the simple hypothesis test

versus (6)

is passed by the null hypothesis under the likelihood-ratiodecision rule. Now, suppose that are drawn inde-pendently from . To each , we can associate a classi-fier defined by . Calltwo sets equivalent with respect to the sample

, and write , if their associatedclassifiers yield identical classification patterns

It is easy to see that is an equivalence relation. From thedefinitions of the shatter coefficients and the VC di-mension (cf., Section II-C), we see that the cardinalityof the quotient set is equal to for all sample sizes

, whereas for , it is bounded fromabove by , which is strictly less than . Thus, the factthat the Yatracos class has finite VC dimension impliesthat the problem of estimating the density from a large i.i.d.sample reduces, in a sense, to a finite number [in fact, polyno-mial in the sample size , for ] of simple hypoth-esis tests of the type (6). Our Condition 1 will then allow us totransfer this intuition to (weakly) dependent samples.

Now that we have listed the regularity conditions that musthold for the sources in , we can state our mainresult.

Theorem 3.1: Let be a parametric class ofsources satisfying Conditions 1–3. Then, for every andevery , there exists a sequence of vari-able-rate vector quantizers with memory length

and effective memory length , such that, for all

(7)where the constants implicit in the notation depend on .Furthermore, for each , the binary description produced by theencoder is such that the decoder can identify the -dimensionalmarginal of the active source up to a variational ball of radius

with probability one.

What (7) says is that, for each block length and each ,the code , which is independent of , performs almost aswell as the best finite-memory quantizer with block lengththat can be designed with full a priori knowledge of the -di-mensional marginal . Thus, as far as compression goes, ourscheme can compete with all finite-memory variable-rate lossyblock codes (vector quantizers), with the additional bonus ofallowing the decoder to identify the active source in an asymp-totically optimal manner.

It is not hard to see that the double infimum in (7) is achievedalready in the zero-memory case . Indeed, it is immediate

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1950 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

that having nonzero memory can only improve the Lagrangianperformance, i.e.,

On the other hand, given any code , we canconstruct a zero-memory code , such that

for all . To see this, definefor each the set

for some

and let

and . Then, given any , let. We then have

Taking expectations, we see that forall , which proves that

The infimum of over all zero-memory variable-ratequantizers with block length is the operational th-orderdistortion-rate Lagrangian [20]. Because each is er-godic, converges to the distortion-rate Lagrangian

where is the Shannon distortion-rate function of (seeLemma 2 in Appendix of [14]). Thus, our scheme is universalnot only in the th-order sense of (7), but also in the distortion-rate sense, i.e.,

as

for every . Thus, in the terminology of [14], our schemeis weakly minimax universal for .

IV. PROOF OF THEOREM 3.1

A. The Main Idea

In this section, we describe the main idea behind the proofand fix some notation. We have already seen that it suffices toconstruct a universal scheme that can compete with all zero-memory variable-rate quantizers. That is, it suffices to show thatthere exists a sequence of codes, such that

(8)This is what we will prove.

We assume throughout that the “true” source is for some. Our code operates as follows. Suppose the following.

• Both the encoder and the decoder have access to a count-ably infinite “database” , where each

. Using Elias’ universal representation of the integers[36], we can associate to each a unique binary string

with bits.• A sequence of positive reals is given, such that

as

(we will specify the sequence later in the proof).• For each and each , there exists a zero-

memory -block code that achieves (orcomes arbitrarily close to) the th-order Lagrangian op-timum for : .

Fix the block length . Because the source is stationary, it suf-fices to describe the mapping of into . The encoding isdone as follows

1) The encoder estimates from the -block as, where .

2) The encoder then computes the waiting time

with the standard convention that the infimum of the emptyset is equal to . That is, the encoder looks through thedatabase and finds the first , such that the -dimen-sional distribution is in the variational ball of radius

around .

3) If , the encoder sets ; otherwise, the en-coder sets , where is some default parametervector, say, .

4) The binary description of is a concatenation of the fol-lowing three binary strings: i) a 1-bit flag to tell whether

is finite or infinite ; ii) a binary stringwhich is equal to if or to an empty

string if ; iii) . The stringis called the first-stage description, while is called thesecond-stage description.

The decoder receives , determines from , and producesthe reproduction . Note that when (which,as we will show, will happen eventually a.s.), lies in thevariational ball of radius around the estimated source .

If the latter is a good estimate of , i.e., asalmost surely (a.s.), then the estimate of the true source

computed by the decoder is only slightly worse. Furthermore, aswe will show, the almost-sure convergence of to zeroas implies that the Lagrangian performance of on

is close to the optimum .Formally, the code comprises the following maps:• the parameter estimator ;• the parameter encoder , where

;• the parameter decoder .

Let denote the composition of the parameter estimatorand the parameter encoder, which we refer to as the first-stageencoder, and let denote the composition of the parameterdecoder and the first-stage encoder. The decoder is the first-stage decoder. The collection defines the second-

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

RAGINSKY: JOINT UNIVERSAL LOSSY CODING AND IDENTIFICATION OF STATIONARY MIXING SOURCES 1951

stage codes. The encoder and thedecoder of are defined as

and

respectively. To assess the performance of , consider thefunction

The expectation of with respect tois precisely the Lagrangian performance of , at La-

grange multiplier , on the source . We consider separatelythe contributions of the first-stage and the second-stage codes.Define another function by

so that is the (random) La-grangian performance of the code on . Hence

so, taking expectations, we get

(9)

Our goal is to show that the first term in (9) converges to theth-order optimum , and that the second term is .The proof itself is organized as follows. First, we motivate

the choice of the memory lengths in Section IV-B. Then,we indicate how to select the database (Section IV-C) andhow to implement the parameter estimator (Section IV-D) andthe parameter encoder/decoder pair (Section IV-E). Theproof is concluded by estimating the Lagrangian performance ofthe resulting code (Section IV-F) and the fidelity of the sourceidentification at the decoder (Section IV-G). In the following,(in)equalities involving the relevant random variables are as-sumed to hold for all realizations and not just a.s., unless spec-ified otherwise.

B. The Memory Length

Let , where is the common decay exponentof the -mixing coefficients in Condition 1, and let

. We divide the -block into blocksof length interleaved by blocks of

Fig. 1. The structure of the code� . The shaded blocks are those used forestimating the source parameters.

length (see Fig. 1). The parameter estimator , although de-fined as acting on the entire , effectively will make useonly of . The ’s are each distributed ac-cording to , but they are not independent. Thus, the set

is the effective memory of , and the effective memorylength is .

Let denote the marginal distribution of , and letdenote the product of copies of . We now show that we canapproximate by in variational distance, increasinglyfinely with . Note that both and are defined on the

-algebra , generated by all except those in ,so that . Therefore, usinginduction and the definition of the -mixing coefficient (cf.,Section II-A), we have

where the last equality follows from Condition 1 and from ourchoice of . This in turn implies the following useful fact (seealso [29, Lemma 4.1]), which we will heavily use in the proof:for any measurable function with

(10)

where the constant hidden in the notation depends onand on .

C. Construction of the Database

The database, or the first-stage codebook, is constructedby random selection. Let be a probability distribution on

which is absolutely continuous with respect to the Lebesguemeasure and has an everywhere positive and continuous den-sity . Let be a collection of independentrandom vectors taking values in , each generated according to

independently of . We use to denote the process distri-bution of .

Note that the first-stage codebook is countably infinite, whichmeans that, in principle, both the encoder and the decoder musthave unbounded memory in order to store it. This difficulty canbe circumvented by using synchronized random number gen-erators at the encoder and at the decoder, so that the entries of

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1952 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

can be generated as needed. Thus, by construction, the en-coder will generate samples (where is the waiting time)and then communicate (a binary encoding of) to the decoder.Since the decoder’s random number generator is synchronizedwith that of the encoder’s, the decoder will be able to recoverthe required entry of .

D. Parameter Estimation

The parameter estimator is constructed as fol-lows. Because the source is stationary, it suffices to describethe action of on . In the notation of Section IV-A, let

be the empirical distribution of . Forevery , define

where is the Yatracos class defined by the th-order densi-ties (see Section III). Finally, defineas any satisfying

where the extra term is there to ensure that at least onesuch exists. This is the so-called minimum-distance (MD)density estimator of Devroye and Lugosi [34], [35] (see also[37]), adapted to the dependent-process setting of this paper.The key property of the MD estimator is that

(11)

(see, e.g., [37, Th. 5.1]). This holds regardless of whether thesamples are independent.

E. Encoding and Decoding of Parameter Estimates

Next we construct the parameter encoder-decoder pair .Given a , define the waiting time

with the standard convention that the infimum of the empty setis equal to . That is, given a , the parameter encoderlooks through the codebook and finds the position of the first

such that the variational distance between the th-orderdistributions and is at most . If no such isfound, the encoder sets . We then define the mapsand by

ifif

and

respectively. Thus, , and the bound

(12)

holds for every , regardless of whether is finite orinfinite.

F. Performance of the Code

Given the random codebook , the expected Lagrangian per-formance of our code on the source is

(13)

We now upper-bound the two terms in (13). We start with thesecond term.

We need to bound the expectation of the waiting time. Our strategy borrows some elements from

[38]. Consider the probability

which is a random function of . From Condition 2, itfollows for sufficiently large that

where . Because the density is every-where positive, the latter probability is strictly positive for al-most all , and so eventually a.s. Thus, thewaiting times will be finite eventually a.s. (with respect toboth the source and the first-stage codebook ). Now, if

, then, conditioned on , the waitingtime is a geometric random variable with parameter , andit is not difficult to show (see, e.g., [38, Lemma 3]) that for any

Setting , we have, for almost all , that

Then, by the Borel–Cantelli lemma

eventually a.s., so that

(14)

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

RAGINSKY: JOINT UNIVERSAL LOSSY CODING AND IDENTIFICATION OF STATIONARY MIXING SOURCES 1953

for almost every realization of the random codebook and forsufficiently large . We now obtain an asymptotic lower boundon . Define the events

Then, by the triangle inequality, we have

and

and, for sufficiently large, we can write

where (a) follows from the independence of and , (b) fol-lows from the fact that the parameter estimator de-pends only on , and (c) follows from Condition 2 and thefact that . Since the density is everywhere positiveand continuous at , for all forsufficiently large, so

(15)

where is the volume of the unit sphere in . Next, the factthat the minimum-distance estimate depends onlyon implies that the event belongs to the -algebra ,and from (10), we get

(16)

Under , the -blocks are i.i.d. according to ,and we can invoke the VC machinery to lower bound .In the notation of Section IV-D, define the event

Then, implies by (11), and

(17)

where the second bound is by the VC inequality (3) of Lemma2.1. Combining the bounds (16) and (17) and using Condition1, we obtain

(18)Now, if we choose

then the right-hand side of (18) can be further lower bounded by. Combining this with (15), taking logarithms, and

then taking expectations, we obtain

where is a constant that depends only on and . Usingthis and (14), we get that

for -almost every realization of the random codebook , forsufficiently large. Together with (12), this implies that

for -almost all realizations of the first-stage codebook.We now turn to the first term in (13). Recall that, for each

, the code is th-order optimal for . Using this facttogether with the boundedness of the distortion measure , wecan invoke Lemma A.3 in the Appendix and assume withoutloss of generality that each has a finite codebook (of sizenot exceeding ), and each codevector can be describedby a binary string of no more than bits. Hence,

. Let and be the marginaldistributions of on and , respectively.Note that does not depend on . This,together with Condition 1 and the choice of , implies that

Furthermore

where (a) follows by Fubini’s theorem and the boundedness of, while (b) follows from the definition of . The Lagrangian

performance of the code , where , can befurther bounded as

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1954 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

where (a) follows from Lemma A.3 in the Appendix, (b) fol-lows from the th-order optimality of for , (c) follows,overbounding slightly, from the Lagrangian mismatch bound ofLemma A.2 in the Appendix, and (d) follows from the triangleinequality. Taking expectations, we obtain

(19)

The second term in (19) can be interpreted as the esti-mation error due to estimating by , while the firstis the approximation error due to quantization of the parameterestimate . We examine the estimation error first. Using (11),we can write

(20)

Now, each is distributed according to , and we can ap-proximate the expectation of with respect to bythe expectation of with respect to the product measure

where the second estimate follows from the VC inequality (4)and from the choice of . This, together with (20), yields

(21)

As for the first term in (19), we have, by construction ofthe first-stage encoder, that

(22)

eventually a.s., so the corresponding expectation isas well. Summing the estimates (21) and

(22), we obtain

Finally, putting everything together, we see that, eventually

(23)

for -almost every realization of the first-stage codebook .This proves (8), and hence (7).

G. Identification of the Active Source

We have seen that the expected variational distancebetween the -dimensional

marginals of the true source and the estimated sourceconverges to zero as . We wish

to show that this convergence also holds eventually with prob-ability one, i.e.,

(24)

-a.s.Given an , we have by the triangle inequality that

implies

where is the minimum-distance estimate offrom (cf., Section IV-D). Recalling our constructionof the first-stage encoder, we see that this further implies

Finally, using the property (11) of the minimum-distance esti-mator, we obtain that

implies

Therefore

(25)

where (a) follows, as before, from the definition of the -mixingcoefficient and (b) follows by the VC inequality. Now, if wechoose

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

RAGINSKY: JOINT UNIVERSAL LOSSY CODING AND IDENTIFICATION OF STATIONARY MIXING SOURCES 1955

for an arbitrary small , then (25) can be further upperbounded by , which, owing to Condition1 and the choice , is summable in . Thus

and we obtain (24) by the Borel–Cantelli lemma.

V. EXAMPLES

A. Stationary Memoryless Sources

As a basic check, let us see how Theorem 3.1 applies tostationary memoryless (i.i.d.) sources. Let , and let

be the collection of all Gaussian i.i.d. processes,where

Then, the -dimensional marginal for a given hasthe Gaussian density

with respect to the Lebesgue measure. This class of sources triv-ially satisfies Condition 1 with , and it remains to checkConditions 2 and 3.

To check Condition 2, let us examine the normalizedth-order relative entropy between and , with

and . Because the sources are i.i.d.,

Applying the inequality and some straightforwardalgebra, we get the bound

Now fix a small , and suppose that . Then,, so we can further upper bound by

Thus, for a given , we see that

for all in the open ball of radius around , with. Using Pinsker’s inequality, we have

for all . Thus, Condition 2 holds.

To check Condition 3, note that, for each , the Yatracos classconsists of all sets of the form

(26)

for all ; . Let and. Then, we can rewrite (26) as

This is the set of all such that

where is a third-degree polynomial in the six param-eters . It then follows from Lemma2.2 that is a VC class with . There-fore, Condition 3 holds as well.

B. Autoregressive Sources

Again, let and consider the case when is a Gaussianautoregressive source of order , i.e., it is the output of an autore-gressive filter of order driven by white Gaussian noise. Then,there exist real parameters (the filter coefficients),such that

where is an i.i.d. Gaussian process with zeromean and unit variance. Let be the set of all ,such that the roots of the polynomial , where

, lie outside the unit circle in the complex plane. Thisensures that is a stationary process. We now proceed to checkthat Conditions 1–3 of Section III are satisfied.

The distribution of each is absolutely continuous, and wecan invoke the result of Mokkadem [31] to conclude that, foreach , the process is geometrically mixing, i.e., forevery , there exists some , such that

. Now, for any fixed , forsufficiently large, so Condition 1 holds.

To check Condition 2, note that, for each , the Fisher in-formation matrix is independent of (see, e.g., [39, Sec.VI]). Thus, the asymptotic Fisher information matrix

exists and is nonsingular [39, Th. 6.1], so, re-calling the discussion in Section III, we conclude that Condition2 holds also.

To verify Condition 3, consider the -dimensional marginal, which has the Gaussian density

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1956 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

where is the th-order autocorrela-tion matrix of . Thus, the Yatracos class consists of setsof the form

for all . Now, for every , let. Since is uniquely deter-

mined by , we have for all . Usingthis fact, as well as the easily established fact that the entriesof the inverse covariance matrix are second-degreepolynomials in the filter coefficients , we see that,for each , the condition can be expressed as

, where is quadratic in thereal variables . Thus, we can applyLemma 2.2 to conclude that .Therefore, Condition 3 is satisfied as well.

C. Hidden Markov Processes

A hidden Markov process (or a hidden Markov model;see, e.g., [40]) is a discrete-time bivariate random process

, where is a homogeneous Markov chainand is a sequence of random variables whichare conditionally independent given , and the conditionaldistribution of is time-invariant and depends on onlythrough . The Markov chain , also called the regime, is notavailable for observation. The observable component is thesource of interest. In information theory (see, e.g., [41] andreferences therein), a hidden Markov process is a discrete-timefinite-state homogeneous Markov chain , observed througha discrete-time memoryless channel, so that is theobservation sequence at the output of the channel.

Let denote the number of states of . We assumewithout loss of generality that the state space of is the set

. Let denote thetransition matrix of , where . Ifis ergodic (i.e., irreducible and aperiodic), then there exists aunique probability distribution on such that (thestationary distribution of ); see, e.g., [42, Sec. 8]. Because inthis paper we deal with two-sided random processes, we assumethat has been initialized with its stationary distribution atsome time sufficiently far away in the past, and can thereforebe thought of as a two-sided stationary process. Now considera discrete-time memoryless channel with input alphabet andoutput (observation) alphabet for some . Itis specified by a set of transitiondensities (with respect to , the Lebesgue measure on ). Thechannel output sequence is the source of interest.

Let us take as the parameter space the set ofall transition matrices , such that all forsome fixed . For each and each ,the density is given by

where for every . We assume that the channeltransition densities , are fixed a priori, and do notinclude them in the parametric description of the sources. Wedo require, though, that

and

We now proceed to verify that Conditions 1–3 of Section III aremet.

Let denote the -step transitionprobability for states . The positivity of implies thatthe Markov chain is geometrically ergodic, i.e.,

(27)

where and ; see [42, Th. 8.9]. Note that (27)implies that

This in turn implies that the sequence is exponentially-mixing; see [24, Th. 3.10]. Now, one can show (see [24, Sec.

3.5.3]) that there exists a measurable mapping, such that , where is an i.i.d. se-

quence of random variables distributed uniformly on , in-dependently of . It is not hard to show that, if is exponentially

-mixing, then so is the bivariate process . Finally,because is given by a time-invariant deterministic functionof , the -mixing coefficients of are bounded by thecorresponding -mixing coefficients of , and so is ex-ponentially -mixing as well. Thus, for each , there existsa , such that , and conse-quently Condition 1 holds.

To show that Condition 2 holds, we again examine the asymp-totic behavior of the Fisher information matrix as .Under our assumptions on the state transition matrices in andon the channel transition densities , we can in-voke the results of [43, Sec. 6.2] to conclude that the asymptoticFisher information matrix exists (thoughit is not necessarily nonsingular). Thus, Condition 2 is satisfied.

Finally, we check Condition 3. The Yatracos class con-sists of all sets of the form

for all . The condition canbe written as , where for each ,is a polynomial of degree in the parameters , ,

. Thus, Lemma 2.2 implies that, so Condition 3 is satisfied as well.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

RAGINSKY: JOINT UNIVERSAL LOSSY CODING AND IDENTIFICATION OF STATIONARY MIXING SOURCES 1957

VI. CONCLUSION AND FUTURE DIRECTIONS

We have shown that, given a parametric family of stationarymixing sources satisfying some regularity conditions, there ex-ists a universal scheme for joint lossy compression and sourceidentification, with the th-order Lagrangian redundancy andthe variational distance between -dimensional marginals ofthe true and the estimated source both converging to zero as

, as the block length tends to infinity. The se-quence quantifies the learnability of the -dimensionalmarginals. This generalizes our previous results from [9] and[10] for i.i.d. sources.

We can outline some directions for future research.

• Both in our earlier work [9], [10] and in this paper, we as-sume that the dimension of the parameter space is known apriori. It would be of interest to consider the case when theparameter space is finite dimensional, but its dimension isnot known. Thus, we would have a hierarchical model class

, where, for each , is an opensubset of , and we could use a complexity regularizationtechnique, such as “structural risk minimization” (see, e.g.,[44] or [22, Ch. 6]), to adaptively trade off the estimationand the approximation errors.

• The minimum-distance density estimator of [34] and [35],which plays the key role in our scheme both here and in[9] and [10], is not easy to implement in practice, espe-cially for multidimensional alphabets. On the other hand,there are two-stage universal schemes, such as that of [14],which do not require memory and select the second-stagecode based on pointwise, rather than average, behavior ofthe source. These schemes, however, are geared towardcompression, and do not emphasize identification. It wouldbe worthwhile to devise practically implementable uni-versal schemes that strike a reasonable compromise be-tween these two objectives.

• Finally, neither here nor in our earlier work [9], [10] havewe considered the issues of optimality. It would be of in-terest to obtain lower bounds on the performance of anyuniversal scheme for joint lossy compression and identifi-cation, say, in the spirit of minimax lower bounds in statis-tical learning theory (cf., e.g., [21, Ch. 14]).

Conceptually, our results indicate that links between sta-tistical modeling (parameter estimation) and universal sourcecoding, exploited in the lossless case by Rissanen [2], [3], arepresent in the domain of lossy coding as well. We should alsomention that another modeling-based approach to universallossy source coding, due to Kontoyiannis and others (see, e.g.,[45] and references therein), treats code selection as a statis-tical estimation problem over a class of model distributions inthe reproduction space. This approach, while closer in spiritto Rissanen’s minimum description length (MDL) principle[46], does not address the problem of joint source coding andidentification, but it provides a complementary perspective on

the connections between lossy source coding and statisticalmodeling.

APPENDIX

PROPERTIES OF LAGRANGE-OPTIMAL

VARIABLE-RATE QUANTIZERS

In this Appendix, we detail some properties of Lagrange-op-timal variable-rate vector quantizers. Our exposition is patternedon the work of Linder [19], with appropriate modifications.

As elsewhere in the paper, let be the source alphabetand the reproduction alphabet, both assumed to be Polishspaces. As before, let the distortion function be induced by a

-bounded metric on a Polish space containing .For every , define the metric on by

For any pair of probability measures on , letbe the set of all probability measures on

having and as marginals, and define theWasserstein metric

(see [47] for more details and applications.) Note that, becauseis a bounded metric

for all . Taking the infimum of both sidesover all and observing that

(see, e.g., [48, Sec. I.5]), we get the useful bound

(A.1)

Now, for each , let denote the set of all discrete proba-bility distributions on with finite entropy. That is,if and only if it is concentrated on a finite or a countable set

, and

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

For every , consider the set of all one-to-onemaps , such that, for each , thecollection satisfies the Kraft inequality, and let

be the minimum expected code length. Since the entropy of isfinite, there is always a minimizing , and the Shannon–Fanobound (see [1, Sec. 5.4]) guarantees that .

Now, for any , any probability distribution on ,and any , define

To give an intuitive meaning to , let and bejointly distributed random variables with and

, such that their joint distribution achieves. Then, is the expected Lagrangian per-

formance, at Lagrange multiplier , of a stochastic variable-ratequantizer which encodes each point as a binary code-word with length and decodes it to in the support ofwith probability .

The following lemma shows that deterministic quantizers areas good as random ones.

Lemma A.1: Let be the expected Lagrangianperformance of an -block variable rate quantizer operating on

, and let be the expected Lagrangian perfor-mance, with respect to , of the best -block variable-rate quan-tizer. Then

Proof: Consider any quantizer with. Let be the distribution of .

Clearly, , and

Hence, . To prove the reverseinequality, suppose that and achievefor some . Let be their joint distribution. Let

be the support of , letachieve , and let be the associated binary

code. Define the quantizer by

if

and

Then

On the other hand

so that , and the lemma isproved.

The following lemma gives a useful upper bound on the La-grangian mismatch.

Lemma A.2: Let , be probability distributions on .Then

Proof: Suppose . Let achieve(or be arbitrarily close). Then

where in (a) we used Lemma A.1, in (b) we used the definitionof , in (c) we used the fact that is a metric andthe triangle inequality, and in (d) we used the bound (A.1).

Finally, Lemma A.3 below shows that, for bounded distortionfunctions, Lagrange-optimal quantizers have finite codebooks.

Lemma A.3: For positive integers and , letdenote the set of all zero-memory variable-rate quantizers withblock length , such that for every , the asso-ciated binary code of satisfies and forevery . Let be a probability distribution on . Then

with and .

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

RAGINSKY: JOINT UNIVERSAL LOSSY CODING AND IDENTIFICATION OF STATIONARY MIXING SOURCES 1959

Proof: Let with encoder and decoderachieve the th-order optimum for .

Let be the shortest binary string in , i.e.,

Without loss of generality, we can take as the minimum-dis-tortion encoder, i.e.,

Thus, for any and any

Hence, for all . Furthermore,.

Now pick an arbitrary reproduction string , let bethe empty binary string (of length zero), and let be the zero-rate quantizer with the constant encoder and thedecoder . Then,

. On the other hand, .Therefore

so that . Hence

Since the strings in must satisfy Kraft’s inequality, we have

which implies that .

ACKNOWLEDGMENT

The author would like to thank A. R. Barron, I. Kontoyiannis,and M. Madiman for stimulating discussions, and the anony-mous reviewers for several useful suggestions that helped im-prove the paper.

REFERENCES

[1] T. M. Cover and J. A. Thomas, Elements of Information Theory. NewYork: Wiley, 1991.

[2] J. Rissanen, “Universal coding, information, prediction, and estima-tion,” IEEE Trans. Inf. Theory, vol. IT-30, no. 4, pp. 629–636, Jul.1984.

[3] J. Rissanen, “Fisher information and stochastic complexity,” IEEETrans. Inf. Theory, vol. 42, no. 1, pp. 40–47, Jan. 1996.

[4] J. Ziv and A. Lempel, “Compression of individual sequences byvariable-rate coding,” IEEE Trans. Inf. Theory, vol. IT-24, no. 5, pp.530–536, Sep. 1978.

[5] J. C. Kieffer, “Strongly consistent code-based identification and orderestimation for constrained finite-state model classes,” IEEE Trans. Inf.Theory, vol. 39, no. 3, pp. 893–902, May 1993.

[6] N. Merhav, “Bounds on achievable convergence rates of parameter es-timation via universal coding,” IEEE Trans. Inf. Theory, vol. 40, no. 4,pp. 1210–1215, Jul. 1994.

[7] T. Weissman and E. Ordentlich, “The empirical distribution of rate-constrained source codes,” IEEE Trans. Inf. Theory, vol. 51, no. 11,pp. 3718–3733, Nov. 2005.

[8] G. Tao, Adaptive Control Design and Analysis. Hoboken, NJ: Wiley,2003.

[9] M. Raginsky, “Joint fixed-rate universal lossy coding and identificationof continuous-alphabet memoryless sources,” IEEE Trans. Inf. Theory,vol. 54, no. 7, pp. 3059–3077, Jul. 2008.

[10] M. Raginsky, “Joint universal lossy coding and identification of i.i.d.vector sources,” in Proc. IEEE Int. Symp. Inf. Theory, Seattle, WA, Jul.2006, pp. 577–581.

[11] E.-H. Yang and Z. Zhang, “On the redundancy of lossy source codingwith abstract alphabets,” IEEE Trans. Inf. Theory, vol. 45, no. 4, pp.1092–1110, May 1999.

[12] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergenceof relative frequencies of events to their probabilities,” Theory Probab.Appl., vol. 16, pp. 264–280, 1971.

[13] P. A. Chou, T. Lookabaugh, and R. M. Gray, “Entropy-constrainedvector quantization,” IEEE Trans. Acoust. Speech Signal Process., vol.37, no. 1, pp. 31–42, Jan. 1989.

[14] P. A. Chou, M. Effros, and R. M. Gray, “A vector quantization ap-proach to universal noiseless coding and quantization,” IEEE Trans.Inf. Theory, vol. 42, no. 4, pp. 1109–1138, Jul. 1996.

[15] R. M. Gray, Entropy and Information Theory. New York: Springer-Verlag, 1990.

[16] V. A. Volkonskii and Y. A. Rozanov, “Some limit theorems for randomfunctions, I,” Theory Probab. Appl., vol. 4, pp. 178–197, 1959.

[17] V. A. Volkonskii and Y. A. Rozanov, “Some limit theorems for randomfunctions, II,” Theory Probab. Appl., vol. 6, pp. 186–198, 1961.

[18] D. L. Neuhoff and R. K. Gilbert, “Causal source codes,” IEEE Trans.Inf. Theory, vol. IT-28, no. 5, pp. 701–713, Sep. 1982.

[19] T. Linder, “Learning-theoretic methods in vector quantization,” inPrinciples of Nonparametric Learning, L. Györfi, Ed. New York:Springer-Verlag, 2001.

[20] M. Effros, P. A. Chou, and R. M. Gray, “Variable-rate source codingtheorems for stationary nonergodic sources,” IEEE Trans. Inf. Theory,vol. 40, no. 6, pp. 1920–1925, Nov. 1994.

[21] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of PatternRecognition. New York: Springer-Verlag, 1996.

[22] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.[23] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estima-

tion. New York: Springer-Verlag, 2001.[24] M. Vidyasagar, Learning and Generalization, 2nd ed. London, U.K.:

Springer-Verlag, 2003.[25] N. Sauer, “On the density of families of sets,” J. Combin. Theory Series

A, vol. 13, pp. 145–147, 1972.[26] R. M. Dudley, “Central limit theorems for empirical measures,” Ann.

Probab., vol. 6, pp. 898–929, 1978.[27] M. Karpinski and A. Macintyre, “Polynomial bounds for VC dimen-

sion of sigmoidal and general Pfaffian neural networks,” J. Comput.Syst. Sci., vol. 54, pp. 169–176, 1997.

[28] S. N. Bernstein, “Sur l’extension du théorème limite du calcul des prob-abilités aux sommes de quantités dependantes,” Math. Ann., vol. 97, pp.1–59, 1927.

[29] B. Yu, “Rates of convergence for empirical processes of stationarymixing sequences,” Ann. Probab., vol. 22, no. 1, pp. 94–116, 1994.

[30] R. Meir, “Nonparametric time series prediction through adaptive modelselection,” Mach. Learn., vol. 39, pp. 5–34, 2000.

[31] A. Mokkadem, “Mixing properties of ARMA processes,” StochasticProcess. Appl., vol. 29, pp. 309–315, 1988.

[32] B. S. Clarke and A. R. Barron, “Information-theoretic asymptotics ofBayes methods,” IEEE Trans. Inf. Theory, vol. 36, no. 3, pp. 453–471,May 1990.

[33] Y. G. Yatracos, “Rates of convergence of minimum distance estimatesand Kolmogorov’s entropy,” Ann. Math. Statist., vol. 13, pp. 768–774,1985.

[34] L. Devroye and G. Lugosi, “A universally acceptable smoothing factorfor kernel density estimation,” Ann. Statist., vol. 24, pp. 2499–2512,1996.

[35] L. Devroye and G. Lugosi, “Nonasymptotic universal smoothing fac-tors, kernel complexity and Yatracos classes,” Ann. Statist., vol. 25, pp.2626–2637, 1997.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.

1960 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 5, MAY 2009

[36] P. Elias, “Universal codeword sets and representations of the integers,”IEEE Trans. Inf. Theory, vol. IT-21, no. 2, pp. 194–203, Mar. 1975.

[37] L. Devroye and L. Györfi, “Distribution and density estimation,” inPrinciples of Nonparametric Learning, L. Györfi, Ed. New York:Springer-Verlag, 2001.

[38] I. Kontoyiannis and J. Zhang, “Arbitrary source models and Bayesiancodebooks in rate-distortion theory,” IEEE Trans. Inf. Theory, vol. 48,no. 8, pp. 2276–2290, Aug. 2002.

[39] A. Klein and P. Spreij, “The Bezoutian, state space realizations andFisher’s information matrix of an ARMA process,” Lin. Algebra Appl.,vol. 416, pp. 160–174, 2006.

[40] P. J. Bickel, Y. Ritov, and T. Rydén, “Asymptotic normality of the max-imum-likelihood estimator for general hidden Markov models,” Ann.Statist., vol. 26, no. 4, pp. 1614–1635, 1997.

[41] Y. Ephraim and N. Merhav, “Hidden Markov processes,” IEEE Trans.Inf. Theory, vol. 48, no. 6, pp. 1518–1569, Jun. 2002.

[42] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley.[43] R. Douc, É. Moulines, and T. Rydén, “Asymptotic properties of the

maximum likelihood estimator in autoregressive models with Markovregime,” Ann. Statist., vol. 32, no. 5, pp. 2254–2304, 2004.

[44] G. Lugosi and K. Zeger, “Concept learning using complexity regular-ization,” IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 48–54, Jan. 1996.

[45] M. Madiman and I. Kontoyiannis, “Second-order properties of lossylikelihoods and the MLE/MDL dichotomy in lossy compression,”Brown University, APPTS Report-5, 2004 [Online]. Available:http://www.dam.brown.edu/ptg/REPORTS/04–5.pdf

[46] A. Barron, J. Rissanen, and B. Yu, “Minimum description length prin-ciple in coding and modeling,” IEEE Trans. Inf. Theory, vol. 44, no. 6,pp. 2743–2760, Oct. 1998.

[47] R. M. Gray, D. L. Neuhoff, and P. S. Shields, “A generalization ofOrnstein’s �� distance with applications to information theory,” Ann.Probab., vol. 3, no. 2, pp. 315–328, 1975.

[48] T. Lindvall, Lectures on the Coupling Method. New York: Dover,2002.

Maxim Raginsky (S’99–M’00) received the B.S. and M.S. degrees in 2000and the Ph.D. degree in 2002 from Northwestern University, Chicago, IL, all inelectrical engineering.

From 2002 to 2004, he was a Postdoctoral Researcher at the Center for Pho-tonic Communication and Computing, Northwestern University, where he pur-sued work on quantum cryptography and quantum communication and infor-mation theory. From 2004 to 2007, he was a Beckman Foundation PostdoctoralFellow at the University of Illinois in Urbana-Champaign, where he carried outresearch on information theory, statistical learning and computational neuro-science. In September 2007, he has joined the Department of Electrical andComputer Engineering at Duke University as a Research Scientist. His interestsinclude statistical signal processing, information theory, statistical learning, andnonparametric estimation. He is particularly interested in problems that combinethe communication, signal processing, and machine learning components in anovel and nontrivial way, as well as in the theory and practice of robust statis-tical inference with limited information.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 22, 2009 at 13:33 from IEEE Xplore. Restrictions apply.