Lecture Notes in Information Theory Volume IIshannon.cm.nctu.edu.tw/it/itvol22004.pdfLecture Notes...

Lecture Notes in Information Theory

Volume II

by

Po-Ning Chen† and Fady Alajaji‡

† Department of Electrical EngineeringInstitute of Communication Engineering

National Chiao Tung University1001, Ta Hsueh Road

Hsin Chu, Taiwan 30056Republic of China

Email: [email protected]

‡ Department of Mathematics & Statistics,Queen’s University, Kingston, ON K7L 3N6, Canada

Email: [email protected]

December 7, 2014

c© Copyright byPo-Ning Chen† and Fady Alajaji‡

December 7, 2014

Preface

The reliable transmission of information bearing signals over a noisy commu-nication channel is at the heart of what we call communication. Informationtheory—founded by Claude E. Shannon in 1948—provides a mathematical frame-work for the theory of communication; it describes the fundamental limits to howefficiently one can encode information and still be able to recover it with neg-ligible loss. This course will examine the basic concepts of this theory. Whatfollows is a tentative list of topics to be covered.

1. Volume I:

(a) Fundamentals of source coding (data compression): Discrete memo-ryless sources, entropy, redundancy, block encoding, variable-lengthencoding, Kraft inequality, Shannon code, Huffman code.

(b) Fundamentals of channel coding: Discrete memoryless channels, mu-tual information, channel capacity, coding theorem for discrete mem-oryless channels, weak converse, channel capacity with output feed-back, the Shannon joint source-channel coding theorem.

(c) Source coding with distortion (rate distortion theory): Discrete mem-oryless sources, rate-distortion function and its properties, rate-dis-tortion theorem.

(d) Other topics: Information measures for continuous random variables,capacity of discrete-time and band-limited continuous-time Gaussianchannels, rate-distortion function of the memoryless Gaussian source,encoding of discrete sources with memory, capacity of discrete chan-nels with memory.

(e) Fundamental backgrounds on real analysis and probability (Appen-dix): The concept of set, supremum and maximum, infimum andminimum, boundedness, sequences and their limits, equivalence, pro-bability space, random variable and random process, relation betweena source and a random process, convergence of sequences of randomvariables, ergodicity and laws of large numbers, central limit theorem,concavity and convexity, Jensen’s inequality.

ii

2. Volumn II:

(a) General information measure: Information spectrum and Quantileand their properties, Renyi’s informatino measures.

(b) Advanced topics of losslesss data compression: Fixed-length losslessdata compression theorem for arbitrary channels, Variable-length loss-less data compression theorem for arbitrary channels, entropy of En-glish, Lempel-Ziv code.

(c) Measure of randomness and resolvability: Resolvability and sourcecoding, approximation of output statistics for arbitrary channels.

(d) Advanced topics of channel coding: Channel capacity for arbitrarysingle-user channel, optimistic Shannon coding theorem, strong ca-pacity, ε-capacity.

(e) Advanced topics of lossy data compressing

(f) Hypothesis testing: Error exponent and divergence, large deviationstheory, Berry-Esseen theorem.

(g) Channel reliability: Random coding exponent, expurgated exponent,partitioning exponent, sphere-packing exponent, the asymptotic lar-gest minimum distance of block codes, Elias bound, Varshamov-Gil-bert bound, Bhattacharyya distance.

(h) Information theory of networks: Distributed detection, data com-pression over distributed source, capacity of multiple access channels,degraded broadcast channel, Gaussian multiple terminal channels.

As shown from the list, the lecture notes are divided into two volumes. Thefirst volume is suitable for a 12-week introductory course as that given at theDepartment of Mathematics and Statistics, Queen’s University at Kingston,Canada. It also meets the need of a fundamental course for senior undergradu-ates as that given at the Department of Computer Science and Information En-gineering, National Chi Nan University, Taiwan. For an 18-week graduate courseas given in Department of Communications Engineering, National Chiao-TungUniversity, Taiwan, the lecturer can selectively add advanced topics covered inthe second volume to enrich the lecture content, and provide a more completeand advanced view on information theory to students.

The authors are very much indebted to all people who provided insightfulcomments on these lecture notes. Special thanks are devoted to Prof. YunghsiangS. Han with the Department of Computer Science and Information Engineeringin National Chi Nan University, Taiwan, for his enthusiasm in testing theselecture notes at his school, and providing the authors valuable feedback.

iii

Notes to readers. In these notes, all the assumptions, claims, conjectures,corollaries, definitions, examples, exercises, lemmas, observations, properties,and theorems are numbered under the same counter for ease of their searching.For example, the lemma that immediately follows Theorem 2.1 will be numberedas Lemma 2.2, instead of Lemma 2.1.

In addition, you may obtain the latest version of the lecture notes fromhttp://shannon.cm.nctu.edu.tw. Interested readers are welcome to return com-ments to [email protected].

iv

Acknowledgements

Thanks are given to our families for their full support during the period ofwriting these lecture notes.

v

Table of Contents

Chapter Page

List of Tables x

List of Figures xi

1 Introduction 11.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Generalized Information Measures for Arbitrary System Statis-tics 32.1 Spectrum and Quantile . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Properties of quantile . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Generalized information measures . . . . . . . . . . . . . . . . . . 112.4 Properties of generalized information measures . . . . . . . . . . . 122.5 Examples for the computation of general information measures . . 192.6 Renyi’s information measures . . . . . . . . . . . . . . . . . . . . 24

3 General Lossless Data Compression Theorems 283.1 Fixed-length data compression codes for arbitrary sources . . . . . 293.2 Generalized AEP theorem . . . . . . . . . . . . . . . . . . . . . . 353.3 Variable-length lossless data compression codes

that minimizes the exponentially weighted codeword length . . . . 383.3.1 Criterion for optimality of codes . . . . . . . . . . . . . . . 383.3.2 Source coding theorem for Renyi’s entropy . . . . . . . . . 39

3.4 Entropy of English . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Markov estimate of entropy rate of English text . . . . . . 403.4.2 Gambling estimate of entropy rate of English text . . . . . 42

A) Sequential gambling . . . . . . . . . . . . . . . . . . . . 423.5 Lempel-Ziv code revisited . . . . . . . . . . . . . . . . . . . . . . 44

vi

4 Measure of Randomness for Stochastic Processes 534.1 Motivation for resolvability : measure of randomness of random

variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Notations and definitions regarding to resolvability . . . . . . . . 544.3 Operational meanings of resolvability and mean-resolvability . . . 584.4 Resolvability and source coding . . . . . . . . . . . . . . . . . . . 65

5 Channel Coding Theorems and Approximations of Output Sta-tistics for Arbitrary Channels 765.1 General models for channels . . . . . . . . . . . . . . . . . . . . . 765.2 Variations of capacity formulas for arbitrary channels . . . . . . . 76

5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 765.2.2 ε-capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.3 General Shannon capacity . . . . . . . . . . . . . . . . . . 875.2.4 Strong capacity . . . . . . . . . . . . . . . . . . . . . . . . 885.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3 Structures of good data transmission codes . . . . . . . . . . . . . 915.4 Approximations of output statistics: resolvability for channels . . 93

5.4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 935.4.2 Notations and definitions of resolvability for channels . . . 935.4.3 Results on resolvability and mean-resolvability for channels 95

6 Optimistic Shannon Coding Theorems for Arbitrary Single-UserSystems 986.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.2 Optimistic source coding theorems . . . . . . . . . . . . . . . . . 996.3 Optimistic channel coding theorems . . . . . . . . . . . . . . . . . 1026.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4.1 Information stable channels . . . . . . . . . . . . . . . . . 1066.4.2 Information unstable channels . . . . . . . . . . . . . . . . 106

7 Lossy Data Compression 1117.1 General lossy source compression for block codes . . . . . . . . . . 111

8 Hypothesis Testing 1198.1 Error exponent and divergence . . . . . . . . . . . . . . . . . . . . 119

8.1.1 Composition of sequence of i.i.d. observations . . . . . . . 1228.1.2 Divergence typical set on composition . . . . . . . . . . . . 1278.1.3 Universal source coding on compositions . . . . . . . . . . 1288.1.4 Likelihood ratio versus divergence . . . . . . . . . . . . . . 1328.1.5 Exponent of Bayesian cost . . . . . . . . . . . . . . . . . . 133

8.2 Large deviations theory . . . . . . . . . . . . . . . . . . . . . . . . 135

vii

8.2.1 Tilted or twisted distribution . . . . . . . . . . . . . . . . 1358.2.2 Conventional twisted distribution . . . . . . . . . . . . . . 1358.2.3 Cramer’s theorem . . . . . . . . . . . . . . . . . . . . . . . 1368.2.4 Exponent and moment generating function: an example . . 137

8.3 Theories on Large deviations . . . . . . . . . . . . . . . . . . . . . 1408.3.1 Extension of Gartner-Ellis upper bounds . . . . . . . . . . 1408.3.2 Extension of Gartner-Ellis lower bounds . . . . . . . . . . 1478.3.3 Properties of (twisted) sup- and inf-large deviation rate

functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1578.4 Probabilitic subexponential behavior . . . . . . . . . . . . . . . . 159

8.4.1 Berry-Esseen theorem for compound i.i.d. sequence . . . . 1598.4.2 Berry-Esseen Theorem with a sample-size dependent mul-

tiplicative coefficient for i.i.d. sequence . . . . . . . . . . . 1688.4.3 Probability bounds using Berry-Esseen inequality . . . . . 170

8.5 Generalized Neyman-Pearson Hypothesis Testing . . . . . . . . . 176

9 Channel Reliability Function 1819.1 Random-coding exponent . . . . . . . . . . . . . . . . . . . . . . 181

9.1.1 The properties of random coding exponent . . . . . . . . . 1859.2 Expurgated exponent . . . . . . . . . . . . . . . . . . . . . . . . . 187

9.2.1 The properties of expurgated exponent . . . . . . . . . . . 1939.3 Partitioning bound: an upper bounds for channel reliability . . . . 1949.4 Sphere-packing exponent: an upper bound of the channel reliability201

9.4.1 Problem of sphere-packing . . . . . . . . . . . . . . . . . . 2019.4.2 Relation of sphere-packing and coding . . . . . . . . . . . 2019.4.3 The largest minimum distance of block codes . . . . . . . . 205

A) Distance-spectrum formula on the largest minimumdistance of block codes . . . . . . . . . . . . . . . 208

B) Determination of the largest minimum distance for aclass of distance functions . . . . . . . . . . . . . 214

C) General properties of distance-spectrum function . . . . 219D) General Varshamov-Gilbert lower bound . . . . . . . . 227

9.4.4 Elias bound: a single-letter upper bound formula on thelargest minimum distance for block codes . . . . . . . . . . 233

9.4.5 Gilbert bound and Elias bound for Hamming distance andbinary alphabet . . . . . . . . . . . . . . . . . . . . . . . . 241

9.4.6 Bhattacharyya distance and expurgated exponent . . . . . 2429.5 Straight line bound . . . . . . . . . . . . . . . . . . . . . . . . . . 243

10 Information Theory of Networks 25210.1 Lossless data compression over distributed sources for block codes 253

10.1.1 Full decoding of the original sources . . . . . . . . . . . . . 253

viii

10.1.2 Partial decoding of the original sources . . . . . . . . . . . 25810.2 Distributed detection . . . . . . . . . . . . . . . . . . . . . . . . . 260

10.2.1 Neyman-Pearson testing in parallel distributed detection . 26710.2.2 Bayes testing in parallel distributed detection systems . . . 274

10.3 Capacity region of multiple access channels . . . . . . . . . . . . . 27510.4 Degraded broadcast channel . . . . . . . . . . . . . . . . . . . . . 27610.5 Gaussian multiple terminal channels . . . . . . . . . . . . . . . . 278

ix

List of Tables

Number Page

2.1 Generalized entropy measures where δ ∈ [0, 1]. . . . . . . . . . . . 132.2 Generalized mutual information measures where δ ∈ [0, 1]. . . . . 142.3 Generalized divergence measures where δ ∈ [0, 1]. . . . . . . . . . 15

x

List of Figures

Number Page

2.1 The asymptotic CDFs of a sequence of random variables, An∞n=1.u(·) = sup-spectrum of An; u

¯(·) = inf-spectrum of An; U

¯1− =

limξ↑1U¯ξ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 The spectrum h(θ) for Example 2.7. . . . . . . . . . . . . . . . . . 222.3 The spectrum i(θ) for Example 2.7. . . . . . . . . . . . . . . . . . 222.4 The limiting spectrums of (1/n)hZn(Zn) for Example 2.8 . . . . . 232.5 The possible limiting spectrums of (1/n)iXn,Y n(Xn;Y n) for Ex-

ample 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Behavior of the probability of block decoding error as block lengthn goes to infinity for an arbitrary source X. . . . . . . . . . . . . 35

3.2 Illustration of generalized AEP Theorem. Fn(δ; ε) , Tn[Hε(X) +δ] \ Tn[Hε(X)− δ] is the dashed region. . . . . . . . . . . . . . . . 37

3.3 Notations used in Lempel-Ziv coder. . . . . . . . . . . . . . . . . 48

4.1 Source generator: Xtt∈I (I = (0, 1)) is an independent randomprocess with PXt(0) = 1− PXt(1) = t, and is also independent ofthe selector Z, where Xt is outputted if Z = t. Source generatorof each time instance is independent temporally. . . . . . . . . . . 72

4.2 The ultimate CDF of −(1/n) logPXn(Xn): Prhb(Z) ≤ t. . . . . 74

5.1 The ultimate CDFs of (1/n) logPNn(Nn). . . . . . . . . . . . . . 895.2 The ultimate CDF of the normalized information density for Ex-

ample 5.14-Case B). . . . . . . . . . . . . . . . . . . . . . . . . . 915.3 The communication system. . . . . . . . . . . . . . . . . . . . . . 945.4 The simulated communication system. . . . . . . . . . . . . . . . 94

7.1 λZ,f(Z)(D + γ) > ε⇒ sup[θ : λZ,f(Z)(θ) ≤ ε] ≤ D + γ. . . . . . . . 1157.2 The CDF of (1/n)ρn(Zn, fn(Zn)) for the probability-of-error dis-

tortion measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.1 The geometric meaning for Sanov’s theorem. . . . . . . . . . . . . 126

xi

8.2 The divergence view on hypothesis testing. . . . . . . . . . . . . . 1348.3 Function of (π/6)u− h(u). . . . . . . . . . . . . . . . . . . . . . . 1688.4 The Berry-Esseen constant as a function of the sample size n. The

sample size n is plotted in log-scale. . . . . . . . . . . . . . . . . . 170

9.1 BSC channel with crossover probability ε and input distribution(p, 1− p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

9.2 Random coding exponent for BSC with crossover probability 0.2.Also plotted is s∗ = arg sup0≤s≤1[−sR− E0(s)]. Rcr = 0.056633. . 188

9.3 Expurgated exponent (solid line) and random coding exponent(dashed line) for BSC with crossover probability 0.2 (over therange of (0, 0.192745)). . . . . . . . . . . . . . . . . . . . . . . . . 195

9.4 Expurgated exponent (solid line) and random coding exponent(dashed line) for BSC with crossover probability 0.2 (over therange of (0, 0.006)). . . . . . . . . . . . . . . . . . . . . . . . . . . 196

9.5 Partitioning exponent (thick line), random coding exponent (thinline) and expurgated exponent (thin line) for BSC with crossoverprobability 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

9.6 (a) The shaded area is U cm; (b) The shaded area is Acm,m′ . . . . . . 247

9.7 ΛX(R) asymptotically lies between ess inf(1/n)µ(Xn, Xn) and(1/n)E[µn(Xn, Xn)|µn(Xn, Xn)] for Rp(X) < R < R0(X). . . . . 248

9.8 General curve of ΛX(R). . . . . . . . . . . . . . . . . . . . . . . . 2489.9 Function of sup

s>0

−sR− s log

[2s(1− e−1/2s

)]. . . . . . . . . . . 249

9.10 Function of sups>0

−sR− s log

[(2 + e−1/s

)/4]

. . . . . . . . . . . 249

10.1 The multi-access channel. . . . . . . . . . . . . . . . . . . . . . . 25310.2 Distributed detection with n senders. Each observations Yi may

come from one of two categories. The final decision D ∈ H0, H1. 25410.3 Distributed detection in Sn. . . . . . . . . . . . . . . . . . . . . . 26010.4 Bayes error probabilities associated with g and g. . . . . . . . . . 26610.5 Upper and lower bounds on e∗NP(α) in Case B. . . . . . . . . . . . 275

xii

Chapter 1

Introduction

This volume will talk about some advanced topics in information theory. Themathematical background on which these topics are based can be found in Ap-pendices A and B of Volume I.

1.1 Notations

Here, we clarify some of the easily confused notations used in Volume II of thelecture notes.

For a random variable X, we use PX to denote its distribution. For con-venience, we will use interchangeably the two expressions for the probability ofX = x, i.e., PrX = x and PX(x). Similarly, for the probability of a set char-acterizing through an inequality, such as f(x) < a, its probability mass will beexpressed by either

PX x ∈ X : f(x) < a

orPr f(X) < a .

In the second expression, we view f(X) as a new random variable defined throughX and a function f(·).

Obviously, the above expressions can be applied to any legitimate functionf(·) defined over X , including any probability function PX(·) (or logPX(x)) of a

random variable X. Therefore, the next two expressions denote the probabilityof f(x) = PX(x) < a evaluated under distribution PX :

PX x ∈ X : f(x) < a = PX x ∈ X : PX(x) < a

andPr f(X) < a = Pr PX(X) < a .

1

As a result, if we write

PX,Y

(x, y) ∈ X × Y : log

PX,Y (x, y)

PX(x)PY (y)< a

= Pr

log

PX,Y (X, Y )

PX(X)PY (Y )< a

,

it means that we define a new function

f(x, y) , logPX,Y (x, y)

PX(x)PY (y)

in terms of the joint distribution PX,Y and its two marginal distributions, andconcern the probability of f(x, y) < a where x and y have distribution PX,Y .

2

Chapter 2

Generalized Information Measures forArbitrary System Statistics

In Volume I of the lecture notes, we show that the entropy, defined by

H(X) , −∑x∈X

PX(x) logPX(x) = EX [− logPX(X)] nats,

of a discrete random variableX is a measure of the average amount of uncertaintyin X. An extension definition of entropy to a sequence of random variablesX1, X2, . . . , Xn, . . . is the entropy rate, which is given by

limn→∞

1

nH(Xn) = lim

n→∞

1

nE [− logPXn(Xn)] ,

assuming the limit exists. The above quantities have an operational significanceestablished via Shannon’s coding theorems when the stochastic systems underconsideration satisfy certain regularity conditions, such as stationarity and ergod-icity [3, 5]. However, in more complicated situations such as when the systemsare non-stationary or with time-varying nature, these information rates are nolonger valid and lose their operational significance. This results in the need toestablish new entropy measures which appropriately characterize the operationallimits of arbitrary stochastic systems.

Let us begin with the model of arbitrary system statistics. In general, thereare two indices for observations: time index and space index. When a sequenceof observations is denoted by X1, X2, . . . , Xn, . . ., the subscript i of Xi can betreated as either a time index or a space index, but not both. Hence, when asequence of observations are functions of both time and space, the notation ofX1, X2, . . . , Xn, . . ., is by no means sufficient; and therefore, a new model for atime-varying multiple-sensor system, such as

X(n)1 , X

(n)2 , . . . , X

(n)t , . . . ,

3

where t is the time index and n is the space or position index (or vice versa),becomes significant.

When block-wise compression of such source (with block length n) is consid-ered, same question as to the compression of i.i.d. source arises:

what is the minimum compression rate (bits per sourcesample) for which the error can be made arbitrarily smallas the block length goes to infinity?

(2.0.1)

To answer the question, information theorists have to find a sequence of datacompression codes for each block length n and investigate if the decompressionerror goes to zero as n approaches infinity. However, unlike those simple sourcemodels considered in Volume I, the arbitrary source for each block length n mayexhibit distinct statistics at respective sample, i.e.,

n = 1 : X(1)1

n = 2 : X(2)1 , X

(2)2

n = 3 : X(3)1 , X

(3)2 , X

(3)3

n = 4 : X(4)1 , X

(4)2 , X

(4)3 , X

(4)4 (2.0.2)

...

and the statistics of X(4)1 could be different from X

(1)1 , X

(2)1 and X

(3)1 . Since it is

the most general model for the question in (2.0.1), and the system statistics canbe arbitrarily defined, it is therefore named arbitrary statistics system.

In notation, the triangular array of random variables in (2.0.2) is often de-noted by a boldface letter as

X , Xn∞n=1 ,

where Xn ,(X

(n)1 , X

(n)2 , . . . , X

(n)n

); for convenience, the above statement is

sometimes briefed as

X ,Xn =

(X

(n)1 , X

(n)2 , . . . , X(n)

n

)∞n=1

.

In this chapter, we will first introduce a new concept on defining informationmeasures for arbitrary system statistics and then, analyze in detail their algebraicproperties. In the next chapter, we will utilize the new measures to establishgeneral source coding theorems for arbitrary finite-alphabet sources.

4

2.1 Spectrum and Quantile

Definition 2.1 (inf/sup-spectrum) If An∞n=1 is a sequence of random vari-ables, then its inf-spectrum u(·) and its sup-spectrum u(·) are defined by

u(θ) , lim infn→∞

PrAn ≤ θ,

andu(θ) , lim sup

n→∞PrAn ≤ θ.

In other words, u(·) and u(·) are respectively the liminf and the limsup ofthe cumulative distribution function (CDF) of An. Note that by definition, theCDF of An — PrAn ≤ θ — is non-decreasing and right-continuous. However,for u(·) and u(·), only the non-decreasing property remains.1

Definition 2.2 (quantile of inf/sup-spectrum) For any 0 ≤ δ ≤ 1, the

1It is pertinent to also point out that even if we do not require right-continuity as a funda-mental property of a CDF, the spectrums u(·) and u(·) are not necessarily legitimate CDFs of(conventional real-valued) random variables since there might exist cases where the “probabi-lity mass escapes to infinity” (cf. [1, pp. 346]). A necessary and sufficient condition for u(·)and u(·) to be conventional CDFs (without requiring right-continuity) is that the sequenceof distribution functions of An is tight [1, pp. 346]. Tightness is actually guaranteed if thealphabet of An is finite.

5

quantiles2 Uδ and Uδ of the sup-spectrum and the inf-spectrum are defined by3

Uδ , supθ : u(θ) ≤ δ

andUδ , supθ : u(θ) ≤ δ,

respectively. It follows from the above definitions that Uδ and Uδ are right-continuous and non-decreasing in δ. Note that the supremum of an empty set isdefined to be −∞.

Based on the above definitions, the liminf in probability U of An∞n=1 [4],which is defined as the largest extended real number such that for all ξ > 0,

limn→∞

Pr[An ≤ U − ξ] = 0,

2Generally speaking, one can define “quantile” in four different ways:

Uδ , supθ : lim infn→∞

Pr[An ≤ θ] ≤ δ

Uδ− , supθ : lim infn→∞

Pr[An ≤ θ] < δ

U+δ , supθ : lim inf

n→∞Pr[An < θ] ≤ δ

U+δ− , supθ : lim inf

n→∞Pr[An < θ] < δ.

The general relations between these four quantities are as follows:

Uδ− = U+δ− ≤ Uδ = U+

δ .

Obviously, Uδ− ≤ Uδ ≤ U+δ and Uδ− ≤ U+

δ− by their definitions. It remains to show that

U+δ− ≤ Uδ, that U+

δ ≤ Uδ and that U+δ− ≤ Uδ− .

Suppose U+δ− > Uδ + γ for some γ > 0. Then by definition of U+

δ− , lim infn→∞ Pr[An <Uδ+γ] < δ, which implies lim infn→∞ Pr[An ≤ Uδ+γ/2] ≤ lim infn→∞ Pr[An < Uδ+γ] < δ ≤ δand violates the definition of Uδ. This completes the proof of U+

δ− ≤ Uδ. To prove that

U+δ ≤ Uδ, note that from the definition of U+

δ , we have that for any ε > 0, lim infn→∞ Pr[An <U+δ − ε] ≤ δ and hence lim infn→∞ Pr[An ≤ U+

δ − 2ε] ≤ δ, implying that U+δ − 2ε ≤ Uδ. The

latter yields that U+δ ≤ infε>0[U+

δ − 2ε] = Uδ, which completes the proof. Proving thatU+δ− ≤ Uδ− follows a similar argument.It is worth noting that Uδ− = limξ↑δ Uξ. Their equality can be proved by first observing

that Uδ− ≥ limξ↑δ Uξ by their definitions, and then assuming that γ ,(Uδ−

)− limξ↑δ Uξ > 0.

Then Uξ < Uξ + γ/2 ≤ (Uδ−)− γ/2 implies that u((Uδ−)− γ/2) > ξ for ξ arbitrarily close to δfrom below, which in turn implies u((Uδ−)− γ/2) ≥ δ, contradicting to the definition of Uδ− .In the lecture notes, we will interchangeably use Uδ− and limξ↑δ Uδ for convenience.

The final note is that U+δ− and U+

δ will not be used in defining our general informationmeasures. They are introduced only for mathematical interests.

3Note that the usual definition of the quantile function φ(δ) of a non-decreasing functionF (·) is slightly different from our definition [1, pp. 190], where φ(δ) , supθ : F (θ) < δ.Remark that if F (·) is strictly increasing, then the quantile is nothing but the inverse of F (·):φ(δ) = F−1(δ).

6

satisfies4

U = limδ↓0

Uδ = U0.

Also, the limsup in probability U (cf. [4]), defined as the smallest extended realnumber such that for all ξ > 0,

limn→∞

Pr[An ≥ U + ξ] = 0,

is exactly5

U = limδ↑1

Uδ = supθ : u(θ) < 1,

Straightforwardly by their definitions,

U ≤ Uδ ≤ Uδ ≤ U

for δ ∈ [0, 1).

Remark that Uδ and Uδ always exist. Furthermore, if Uδ = Uδ for all δ in[0, 1], the sequence of random variables An converges in distribution to a randomvariable A, provided the sequence of An is tight.

For a better understanding of the quantities defined above, we depict themin Figure 2.1.

4It is obvious from their definitions that

limδ↓0

Uδ ≥ U0 ≥ U.

The equality of limδ↓0 Uδ and U can be proved by contradiction by first assuming

γ , limδ↓0

Uδ − U > 0.

Then u(U + γ/2) ≤ δ for arbitrarily small δ > 0, which immediately implies u(U + γ/2) = 0,contradicting to the definition of U .

5Since 1 = limn→∞ PrAn < U + ξ ≤ limn→∞ PrAn ≤ U + ξ = u(U + ξ), it is straight-forward that

U ≥ supθ : u(θ) < 1 = limδ↑1

Uδ.

The equality of U and limδ↑1 Uδ can be proved by contraction by first assuming that

γ , U − limδ↑1

Uδ > 0.

Then 1 ≥ u(U−γ/2) > δ for δ arbitrarily close to 1, which implies u(U−γ/2) = 1. Accordingly,by

1 ≥ lim infn→∞

PrAn < U − γ/4 ≥ lim infn→∞

PrAn ≤ U − γ/2 = u(U − γ/2) = 1,

we obtain the desired contradiction.

7

-

6

u(·) u(·)

U U

δ

0

1

U1−U0 UδUδ

Figure 2.1: The asymptotic CDFs of a sequence of random variables,An∞n=1. u(·) = sup-spectrum of An; u

¯(·) = inf-spectrum of An; U

¯1− =

limξ↑1U¯ξ.

2.2 Properties of quantile

Lemma 2.3 Consider two random sequences, An∞n=1 and Bn∞n=1. Let u(·)and u(·) be respectively the sup- and inf-spectrums of An∞n=1. Similarly, letv(·) and v(·) denote respectively the sup- and inf-spectrums of Bn∞n=1. DefineUδ and Uδ be the quantiles of the sup- and inf-spectrums of An∞n=1; also defineVδ and Vδ be the quantiles of the sup- and inf-spectrums of Bn∞n=1.

Now let (u+ v)(·) and (u+ v)(·) denote the sup- and inf-spectrums of sumsequence An +Bn∞n=1, i.e.,

(u+ v)(θ) , lim supn→∞

PrAn +Bn ≤ θ,

and(u+ v)(θ) , lim inf

n→∞PrAn +Bn ≤ θ.

Again, define (U + V )δ and (U + V )δ be the quantiles with respect to (u+ v)(·)and (u+ v)(·).

Then the following statements hold.

1. Uδ and Uδ are both non-decreasing and right-continuous functions of δ forδ ∈ [0, 1].

8

2. limδ↓0 Uδ = U0 and limδ↓0 Uδ = U0.

3. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1,

(U + V )δ+γ ≥ Uδ + Vγ, (2.2.1)

and(U + V )δ+γ ≥ Uδ + Vγ. (2.2.2)

4. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1,

(U + V )δ ≤ Uδ+γ + V(1−γ), (2.2.3)

and(U + V )δ ≤ Uδ+γ + V(1−γ). (2.2.4)

Proof: The proof of property 1 follows directly from the definitions of Uδ andUδ and the fact that the inf-spectrum and the sup-spectrum are non-decreasingin δ.

The proof of property 2 can be proved by contradiction as follows. Supposelimδ↓0 Uδ > U0 + ε for some ε > 0. Then for any δ > 0,

u(U0 + ε/2) ≤ δ.

Since the above inequality holds for every δ > 0, and u(·) is a non-negativefunction, we obtain u(U0 + ε/2) = 0, which contradicts to the definition of U0.We can prove limδ↓0 Uδ = U0 in a similar fashion.

To show (2.2.1), we observe that for α > 0,

lim supn→∞

PrAn +Bn ≤ Uδ + Vγ − 2α

≤ lim supn→∞

(PrAn ≤ Uδ − α+ PrBn ≤ Vγ − α)

≤ lim supn→∞

PrAn ≤ Uδ − α+ lim supn→∞

PrBn ≤ Vγ − α

≤ δ + γ,

which, by definition of (U + V )δ+γ, yields

(U + V )δ+γ ≥ Uδ + Vγ − 2α.

The proof is completed by noting that α can be made arbitrarily small.

9

Similarly, we note that for α > 0,

lim infn→∞

PrAn +Bn ≤ Uδ + Vγ − 2α

≤ lim infn→∞

(PrAn ≤ Uδ − α+ PrBn ≤ Vγ − α

)≤ lim sup

n→∞PrAn ≤ Uδ − α+ lim inf

n→∞PrBn ≤ Vγ − α

≤ δ + γ,

which, by definition of (U + V )δ+γ and arbitrarily small α, proves (2.2.2).

To show (2.2.3), we first observe that (2.2.3) trivially holds when γ = 1 (andδ = 0). It remains to prove its validity under γ < 1. Remark from (2.2.1) that

(U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = Uδ+γ.

Hence,(U + V )δ ≤ Uδ+γ − (−V )γ.

(Note that the case γ = 1 is not allowed here because it results in U1 = (−V )1

=∞, and the subtraction between two infinite terms is undefined. That is why weneed to exclude the case of γ = 1 for the subsequent proof.) The proof is thencompleted by showing that

−(−V )γ ≤ V(1−γ). (2.2.5)

By definition,

(−v)(θ) , lim supn→∞

Pr −Bn ≤ θ

= 1− lim infn→∞

Pr Bn < −θ

Then

V(1−γ) , supθ : v(θ) ≤ 1− γ= supθ : lim inf

n→∞Pr[Bn ≤ θ] ≤ 1− γ

≥ supθ : lim infn→∞

Pr[Bn < θ] < 1− γ (cf. footnote 1)

= sup−θ : lim infn→∞

Pr[Bn < −θ] < 1− γ

= sup−θ : 1− lim supn→∞

Pr[−Bn ≤ θ] < 1− γ

= sup−θ : lim supn→∞

Pr[−Bn ≤ θ] > γ

= − infθ : lim supn→∞

Pr[−Bn ≤ θ] > γ

= − supθ : lim supn→∞

Pr[−Bn ≤ θ] ≤ γ

= − supθ : (−v)(θ) ≤ γ= −(−V )γ.

10

Finally, to show (2.2.4), we again note that it is trivially true for γ = 1. We thenobserve from (2.2.2) that (U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = Uδ+γ. Hence,

(U + V )δ ≤ Uδ+γ − (−V )γ.

Using (2.2.5), we have the desired result. 2

2.3 Generalized information measures

In Definitions 2.1 and 2.2, if we let the random variable An equal the normalizedentropy density6

1

nhXn(Xn) , − 1

nlogPXn(Xn)

of an arbitrary source

X =Xn =

(X

(n)1 , X

(n)2 , . . . , X(n)

n

)∞n=1

,

we obtain two generalized entropy measures for X, i.e.,

δ-inf-entropy rateHδ(X) = quantile of sup-spectrum of1

nhXn(Xn)

δ-sup-entropy rate Hδ(X) = quantile of inf-spectrum of1

nhXn(Xn).

Note that the inf-entropy-rateH(X) and the sup-entropy-rate H(X) introducedin [4] are special cases of the δ-inf/sup-entropy rate measures:

H(X) =H0(X), and H(X) = limδ↑1

Hδ(X).

In concept, we can image that if the random variable (1/n)h(Xn) exhibits alimiting distribution, then the sup-entropy rate is the right-margin of the supportof that limiting distribution. For example, suppose that the limiting distributionof (1/n)hXn(Xn) is positive over (−2, 2), and zero, otherwise. Then H(X) = 2.Similarly, the inf-entropy rate is the left margin of the support of the limitingrandom variable lim supn→∞(1/n)hXn(Xn), which is −2 for the same example.

Analogously, for an arbitrary channel

W = (Y |X) =W n =

(W

(n)1 , . . . ,W (n)

n

)∞n=1

,

6The random variable hXn(Xn) , − logPXn(Xn) is named the entropy density. Hence,the normalized entropy density is equal to hXn(Xn) divided or normalized by the blocklengthn.

11

or more specifically, PW = PY |X = PY n|Xn∞n=1 with input X and output Y , ifwe replace An in Definitions 2.1 and 2.2 by the normalized information density

1

niXnWn(Xn;Y n) =

1

niXn,Y n(Xn;Y n) ,

1

nlog

PXn,Y n(Xn, Y n)

PXn(Xn)PY n(Y n),

we get the δ-inf/sup-information rates, denoted respectively by Iδ(X;Y ) andIδ(X;Y ), as:

Iδ(X;Y ) = quantile of sup-spectrum of1

niXnWn(Xn;Y n)

Iδ(X;Y ) = quantile of inf-spectrum of1

niXnWn(Xn;Y n).

Similarly, for a simple hypothesis testing system with arbitrary observationstatistics for each hypothesis,

H0 : PX

H1 : PX

we can replace An in Definitions 2.1 and 2.2 by the normalized log-likelihood ratio

1

ndXn(Xn‖Xn) ,

1

nlog

PXn(Xn)

PXn(Xn)

to obtain the δ-inf/sup-divergence rates, denoted respectively by Dδ(X‖X) andDδ(X‖X), as

Dδ(X‖X) = quantile of sup-spectrum of1

ndXn(Xn‖Xn)

Dδ(X‖X) = quantile of inf-spectrum of1

ndXn(Xn‖Xn).

The above replacements are summarized in Tables 2.1–2.3.

2.4 Properties of generalized information measures

In this section, we will introduce the properties regarding the general informationmeasured defined in the previous section. We begin with the generalization of asimple property I(X;Y ) = H(Y )−H(Y |X).

By taking δ = 0 and letting γ ↓ 0 in (2.2.1) and (2.2.3), we obtain

(U + V ) ≥ U0 + limγ↓0

Vγ ≥ U + V

12

Entropy Measures

system arbitrary source X

norm. entropy density1

nhXn(Xn) , − 1

nlogPXn(Xn)

entropy sup-spectrum h(θ) , lim supn→∞

Pr

1

nhXn(Xn) ≤ θ

entropy inf-spectrum h(θ) , lim inf

n→∞Pr

1

nhXn(Xn) ≤ θ

δ-inf-entropy rate Hδ(X) , supθ : h(θ) ≤ δ

δ-sup-entropy rate Hδ(X) , supθ : h(θ) ≤ δ

sup-entropy rate H(X) , limδ↑1 Hδ(X)

inf-entropy rate H(X) ,H0(X)

Table 2.1: Generalized entropy measures where δ ∈ [0, 1].

and(U + V ) ≤ lim

γ↓0Uγ + lim

γ↓0V(1−γ) = U + V ,

which mean that the liminf in probability of a sequence of random variablesAn +Bn is upper bounded by the liminf in probability of An plus the limsup inprobability of Bn, and is lower bounded by the sum of the liminfs in probabilityof An and Bn. This fact is used in [5] to show that

I(X;Y ) +H(Y |X) ≤H(Y ) ≤ I(X;Y ) + H(Y |X),

or equivalently,

H(Y )− H(Y |X) ≤ I(X;Y ) ≤H(Y )−H(Y |X).

Other properties of the generalized information measures are summarized inthe next Lemma.

Lemma 2.4 For finite alphabet X , the following statements hold.

1. Hδ(X) ≥ 0 for δ ∈ [0, 1].

(This property also applies to Hδ(X), Iδ(X;Y ), Iδ(X;Y ), Dδ(X‖X),and Dδ(X‖X).)

13

Mutual Information Measures

systemarbitrary channel PW = PY |X with inputX and output Y

norm. information density1

niXnWn(Xn;Y n)

,1

nlog

PXn,Y n(Xn, Y n)

PXn(Xn)× PY n(Y n)

information sup-spectrum i(θ) , lim supn→∞

Pr

1

niXnWn(Xn;Y n) ≤ θ

information inf-Spectrum i(θ) , lim inf

n→∞Pr

1


δ-inf-information rate Iδ(X;Y ) , supθ : i(θ) ≤ δ

δ-sup-information rate Iδ(X;Y ) , supθ : i(θ) ≤ δ

sup-information rate I(X;Y ) , limδ↑1 Iδ(X;Y )

inf-information rate I(X;Y ) , I0(X;Y )

Table 2.2: Generalized mutual information measures where δ ∈ [0, 1].

2. Iδ(X;Y ) = Iδ(Y ;X) and Iδ(X;Y ) = Iδ(Y ;X) for δ ∈ [0, 1].

3. For 0 ≤ δ < 1, 0 ≤ γ < 1 and δ + γ ≤ 1,

Iδ(X;Y ) ≤Hδ+γ(Y )−Hγ(Y |X), (2.4.1)

Iδ(X;Y ) ≤ Hδ+γ(Y )− Hγ(Y |X), (2.4.2)

Iγ(X;Y ) ≤ Hδ+γ(Y )−Hδ(Y |X), (2.4.3)

Iδ+γ(X;Y ) ≥Hδ(Y )− H(1−γ)(Y |X), (2.4.4)

andIδ+γ(X;Y ) ≥ Hδ(Y )− H(1−γ)(Y |X). (2.4.5)

(Note that the case of (δ, γ) = (1, 0) holds for (2.4.1) and (2.4.2), and thecase of (δ, γ) = (0, 1) holds for (2.4.3), (2.4.4) and (2.4.5).)

4. 0 ≤Hδ(X) ≤ Hδ(X) ≤ log |X | for δ ∈ [0, 1), where each X(n)i takes values

in X for i = 1, . . . , n and n = 1, 2, . . ..

14

Divergence Measures

system arbitrary sources X and X

An : norm. log-likelihood ratio1

ndXn(Xn‖Xn) ,

1

nlog

PXn(Xn)

PXn(Xn)

divergence sup-spectrum d(θ) , lim supn→∞

Pr

1

ndXn(Xn‖Xn) ≤ θ

divergence inf-spectrum d(θ) , lim inf

n→∞Pr

1


δ-inf-divergence rate Dδ(X‖X) , supθ : d(θ) ≤ δ

δ-sup-divergence rate Dδ(X‖X) , supθ : d(θ) ≤ δ

sup-divergence rate D(X‖X) , limδ↑1 Dδ(X‖X)

inf-divergence rate D(X‖X) , D0(X‖X)

Table 2.3: Generalized divergence measures where δ ∈ [0, 1].

5. Iδ(X,Y ;Z) ≥ Iδ(X;Z) for δ ∈ [0, 1].

Proof: Property 1 holds because

Pr

− 1

nlogPXn(Xn) < 0

= 0,

Pr

1

nlog

PXn(Xn)

PXn(Xn)< −ν

= PXn

xn ∈ X n :

1

nlog

PXn(xn)

PXn(xn)< −ν

=

∑xn∈Xn : PXn (xn)<PXn (xn)e−nν

PXn(xn)

≤∑

xn∈Xn : PXn (xn)<PXn (xn)enνPXn(xn)e−nν

≤ e−nν ·∑

xn∈Xn : PXn (xn)<PXn (xn)enνPXn(xn)

≤ e−νn, (2.4.6)

15

and, by following the same procedure as (2.4.6),

Pr

1

nlog

PXn,Y n(Xn, Y n)

PXn(Xn)PY n(Y n)< −ν

≤ e−νn.

Property 2 is an immediate consequence of the definition.

To show the inequalities in property 3, we first remark that

1

nhY n(Y n) =

1

niXn,Y n(Xn;Y n) +

1

nhXn,Y n(Y n|Xn),

where1

nhXn,Y n(Y n|Xn) , − 1

nlogPY n|Xn(Y n|Xn).

With this fact and for 0 ≤ δ < 1, 0 < γ < 1 and δ + γ ≤ 1, (2.4.1) followsdirectly from (2.2.1); (2.4.2) and (2.4.3) follow from (2.2.2); (2.4.4) follows from(2.2.3); and (2.4.5) follows from (2.2.4). (Note that in (2.4.1), if δ = 0 and γ = 1,then the right-hand-side would be a difference between two infinite terms, whichis undefined; hence, such case is therefore excluded by the condition that γ < 1.For the same reason, we also exclude the cases of γ = 0 and δ = 1.) Now when0 ≤ δ < 1 and γ = 0, we can confirm the validity of (2.4.1), (2.4.2) and (2.4.3),again, by (2.2.1) and (2.2.2), and also examine the validity of (2.4.4) and (2.4.5)by directly taking these values inside. The validity of (2.4.1) and (2.4.2) for(δ, γ) = (1, 0), and the validity of (2.4.3), (2.4.4) and (2.4.5) for (δ, γ) = (0, 1)can be checked by directly replacing δ and γ with the respective numbers.

Property 4 follows from the facts that Hδ(·) is non-decreasing in δ, Hδ(X) ≤H(X), and H(X) ≤ log |X |. The last inequality can be proved as follows.

Pr

1

nhXn(Xn) ≤ log |X |+ ν

= 1− PXn

xn ∈ X n :

1

nlog

PXn(Xn)

1/|X |n< −ν

≥ 1− e−nν ,

where the last step can be obtained by using the same procedure as (2.4.6).Therefore, h(log |X |+ ν) = 1 for any ν > 0, which indicates H(X) ≤ log |X |.

Property 5 can be proved using the fact that

1

niXn,Y n,Zn(Xn, Y n;Zn) =

1

niXn,Zn(Xn;Zn) +

1

niXn,Y n,Zn(Y n;Zn|Xn).

By applying (2.2.1) with γ = 0, and observing I(Y ;Z|X) ≥ 0, we obtain thedesired result. 2

16

Lemma 2.5 (data processing lemma) Fix δ ∈ [0, 1]. Suppose that for everyn, Xn

1 and Xn3 are conditionally independent given Xn

2 . Then

Iδ(X1;X3) ≤ Iδ(X1;X2).

Proof: By property 5 of Lemma 2.4, we get

Iδ(X1;X3) ≤ Iδ(X1;X2,X3) = Iδ(X1;X2),

where the equality holds because

1

nlog

PXn1 ,X

n2 ,X

n3(xn1 , x

n2 , x

n3 )

PXn1(xn1 )PXn

2 ,Xn3(xn2 , x

n3 )

=1

nlog

PXn1 ,X

n2(xn1 , x

n2 )

PXn1(xn1 )PXn

2(xn2 )

.

2

Lemma 2.6 (optimality of independent inputs) Fix δ ∈ [0, 1). Considera finite-alphabet channel with PWn(yn|xn) = PY n|Xn(yn|xn) =

∏ni=1 PYi|Xi(yi|xi)

for all n. For any input X and its corresponding output Y ,

Iδ(X;Y ) ≤ Iδ(X; Y ) = I(X; Y ),

where Y is the output due to X, which is an independent process with the samefirst order statistics as X, i.e., PXn(xn) =

∏ni=1 PXi(xi).

Proof: First, we observe that

1

nlog

PWn(yn|xn)

PY n(yn)+

1

nlog

PY n(yn)

PY n(yn)=

1

nlog

PWn(yn|xn)

PY n(yn).

In other words,

1

nlog

PXnWn(xn, yn)

PXn(xn)PY n(yn)+

1

nlog

PY n(yn)

PY n(yn)=

1

nlog

PXnWn(xn, yn)

PXn(xn)PY n(yn).

By evaluating the above terms under PXnWn = PXn,Y n and defining

z(θ) , lim supn→∞

PXnWn

(xn, yn) ∈ X n × Yn :

1

nlog

PXnWn(xn, yn)

PXn(xn)PY n(yn)≤ θ

and

Zδ(X; Y ) , supθ : z(θ) ≤ δ,

we obtain from (2.2.1) (with γ = 0) that

Zδ(X; Y ) ≥ Iδ(X;Y ) +D(Y ‖Y ) ≥ Iδ(X;Y ),

17

since D(Y ‖Y ) ≥ 0 by property 1 of Lemma 2.4.

Now by the independence and marginal being equal to X of X, we knowthat the induced Y is also independent and has the same marginal distribution7

as Y . Hence,

Pr

∣∣∣∣∣ 1nn∑i=1

(log

PXiWi(Xi, Yi)

PXi(Xi)PY i(Yi)

− EXiWi

[log

PXiWi(Xi, Yi)

PXi(Xi)PY i(Yi)

])∣∣∣∣∣ > γ

= Pr

∣∣∣∣∣ 1nn∑i=1

(log

PXiWi(Xi, Yi)

PXi(Xi)PYi(Yi)− EXiWi

[log

PXiWi(Xi, Yi)

PXi(Xi)PYi(Yi)

])∣∣∣∣∣ > γ

→ 0,

for any γ > 0, where the convergence to zero follows from the Chebyshev’sinequality and the finiteness of the channel alphabets (or more directly, thefiniteness of individual variance). Consequently, z(θ) = 1 for

θ > lim infn→∞

1

n

n∑i=1

EXiWi

[log

PXiWi(Xi, Yi)

PXi(Xi)PYi(Yi)

]= lim inf

n→∞

1

n

n∑i=1

I(Xi;Yi);

and z(θ) = 0 for θ < lim infn→∞(1/n)∑n

i=1 I(Xi;Yi), which implies

Z(X; Y ) =Zδ(X; Y ) = lim infn→∞

1

n

n∑i=1

I(Xi;Yi)

for any δ ∈ [0, 1). Similarly, we can show that

I(X; Y ) = Iδ(X; Y ) = lim infn→∞

1

n

n∑i=1

I(Xi;Yi)

for any δ ∈ [0, 1). Accordingly,

I(X; Y ) =Z(X; Y ) ≥ Iδ(X;Y ).

2

7The claim can be justified as follows.

PY1(y1) =

∑yn2 ∈Yn−1

∑xn∈Xn

PXn(xn)PWn(yn|xn)

=∑x1∈X

PX1(x1)PW1

(y1|x1)∑

yn2 ∈Yn−1

∑xn2∈Xn−1

PXn2 |X1(xn2 |x1)PWn

2(yn2 |xn2 )

=∑x1∈X

PX1(x1)PW1(y1|x1)∑

yn2 ∈Yn−1

∑xn2∈Xn−1

PXn2 Wn2 |X1

(xn2 , yn2 |x1)

=∑x1∈X

PX1(x1)PW1

(y1|x1).

Hence, for a channel with PWn(yn|xn) =∏ni=1 PWi

(yi|xi), the output marginal only dependson the respective input marginal.

18

2.5 Examples for the computation of general informationmeasures

Let the alphabet be binary X = Y = 0, 1, and let every output be given by

Y(n)i = X

(n)i ⊕ Z

(n)i

where⊕ represents the modulo-2 addition operation, andZ is an arbitrary binaryrandom process, independent ofX. Assume thatX is a Bernoulli uniform input,i.e., an i.i.d. random process with equal-probably marginal distribution. Thenit can be derived that the resultant Y is also Bernoulli uniform distributed, nomatter what distribution Z has.

To compute Iε(X;Y ) for ε ∈ [0, 1) ( I1(X;Y ) = ∞ is known), we use theresults of property 3 in Lemma 2.4:

Iε(X;Y ) ≥H0(Y )− H(1−ε)(Y |X), (2.5.1)

andIε(X;Y ) ≤ Hε+γ(Y )− Hγ(Y |X). (2.5.2)

where 0 ≤ ε < 1, 0 ≤ γ < 1 and ε+ γ ≤ 1. Note that the lower bound in (2.5.1)and the upper bound in (2.5.2) are respectively equal to −∞ and ∞ for ε = 0and ε + γ = 1, which become trivial bounds; hence, we further restrict ε > 0and ε+ γ < 1 respectively for the lower and upper bounds.

Thus for ε ∈ [0, 1),

Iε(X;Y ) ≤ inf0≤γ<1−ε

Hε+γ(Y )− Hγ(Y |X)

.

By the symmetry of the channel, Hγ(Y |X) = Hγ(Z), which is independent ofX. Hence,

Iε(X;Y ) ≤ inf0≤γ<1−ε

Hε+γ(Y )− Hγ(Z)

≤ inf

0≤γ<1−ε

log(2)− Hγ(Z)

,

where the last step follows from property 4 of Lemma 2.4. Since log(2)− Hγ(Z)is non-increasing in γ,

Iε(X;Y ) ≤ log(2)− limγ↑(1−ε)

Hγ(Z).

On the other hand, we can derive the lower bound to Iε(X;Y ) in (2.5.1) bythe fact that Y is Bernoulli uniform distributed. We thus obtain for ε ∈ (0, 1],

Iε(X;Y ) ≥ log(2)− H(1−ε)(Z),

19

and

I0(X;Y ) = limε↓0

Iε(X;Y ) ≥ log(2)− limγ↑1

Hγ(Z) = log(2)− H(Z).

To summarize,

log(2)− H(1−ε)(Z) ≤ Iε(X;Y ) ≤ log(2)− limγ↑(1−ε)

Hγ(Z) for ε ∈ (0, 1)

andI(X;Y ) = I0(X;Y ) = log(2)− H(Z).

An alternative method to compute Iε(X;Y ) is to derive its correspondingsup-spectrum in terms of the inf-spectrum of the noise process. Under the equallylikely Bernoulli input X, we can write

i(θ) , lim supn→∞

Pr

1

nlog

PY n|Xn(Y n|Xn)

PY n(Y n)≤ θ

= lim sup

n→∞Pr

1

nlogPZn(Zn)− 1

nlogPY n(Y n) ≤ θ

= lim sup

n→∞Pr

1

nlogPZn(Zn) ≤ θ − log(2)

= lim sup

n→∞Pr

− 1

nlogPZn(Zn) ≥ log(2)− θ

= 1− lim inf

n→∞Pr

− 1

nlogPZn(Zn) < log(2)− θ

.

20

Hence, for ε ∈ (0, 1),

Iε(X;Y ) = sup θ : i(θ) ≤ ε

= sup

θ : 1− lim inf

n→∞Pr

− 1


≤ ε

= sup

θ : lim inf

n→∞Pr

− 1


≥ 1− ε

= sup

(log(2)− β) : lim inf

n→∞Pr

− 1

nlogPZn(Zn) < β

≥ 1− ε

= log(2) + sup

−β : lim inf

n→∞Pr

− 1

nlogPZn(Zn) < β

≥ 1− ε

= log(2)− inf

β : lim inf

n→∞Pr

− 1

nlogPZn(Zn) < β

≥ 1− ε

= log(2)− sup

β : lim inf

n→∞Pr

− 1

nlogPZn(Zn) < β

< 1− ε

≤ log(2)− sup

β : lim inf

n→∞Pr

− 1

nlogPZn(Zn) ≤ β

< 1− ε

= log(2)− lim

δ↑(1−ε)Hδ(Z).

Also, for ε ∈ (0, 1),

Iε(X;Y ) ≥ sup

θ : lim sup

n→∞Pr

[1

nlog

PXn,Y n(Xn, Y n)

PXn(Xn)PY n(Y n)< θ

]< ε

(2.5.3)

= log(2)− sup

β : lim inf

n→∞Pr

− 1

nlogPZn(Zn) ≤ β

≤ 1− ε

= log(2)− H(1−ε)(Z),

where (2.5.3) follows from the fact described in footnote 2. Therefore,

log(2)− H(1−ε)(Z) ≤ Iε(X;Y ) ≤ log(2)− limγ↑(1−ε)

Hγ(Z) for ε ∈ (0, 1).

By taking ε ↓ 0, we obtain

I(X;Y ) = I0(X;Y ) = log(2)− H(Z).

Based on this result, we can now compute Iε(X;Y ) for some specific exam-ples.

Example 2.7 Let Z be an all-zero sequence with probability β and Bernoulli(p)with probability 1− β, where Bernoulli(p) represents a binary Bernoulli process

21

-

0 hb(p)

tt

dd

6

0

β

1

Figure 2.2: The spectrum h(θ) for Example 2.7.

-

1− hb(p) 1

tt

dd

6

0

1− β

1

Figure 2.3: The spectrum i(θ) for Example 2.7.

with individual component being one with probability p. Then the sequence ofrandom variables (1/n)hZn(Zn) converges to 0 and hb(p) with respective massesβ and 1 − β, where hb(p) , −p log p − (1 − p) log(1 − p) is the binary entropyfunction. The resulting h(θ) is depicted in Figure 2.2. From (2.5.3), we obtaini(θ) as shown in Figure 2.3.

Therefore,

Iε(X;Y ) =

1− hb(p), if 0 < ε < 1− β;1, if 1− β ≤ ε < 1.

Example 2.8 If

Z =Zn =

(Z

(n)1 , . . . , Z(n)

n

)∞n=1

is a non-stationary binary independent sequence with

PrZ

(n)i = 0

= 1− Pr

Z

(n)i = 1

= pi,

then by the uniform boundedness (in i) of the variance of random variable

22

− logPZ

(n)i

(Z

(n)i

), namely

Var[− logP

Z(n)i

(Z

(n)i

)]≤ E

[(logP

Z(n)i

(Z

(n)i

))2]

≤ sup0<pi<1

[pi(log pi)

2 + (1− pi)(log(1− pi))2]

< log(2),

we have (by Chebyshev’s inequality)

Pr

∣∣∣∣∣− 1

nlogPZn(Zn)− 1

n

n∑i=1

H(Z

(n)i

)∣∣∣∣∣ > γ

→ 0,

for any γ > 0. Therefore, H(1−ε)(Z) is equal to

H(1−ε)(Z) = H(Z) = lim supn→∞

1

n

n∑i=1

H(Z

(n)i

)= lim sup

n→∞

1

n

n∑i=1

hb(pi)

for ε ∈ (0, 1], and infinity for ε = 0, where hb(pi) = −pi log(pi)−(1−pi) log(1−pi).Consequently,

Iε(X;Y ) =

1− H(Z) = 1− lim supn→∞

1

n

n∑i=1

hb(pi), for ε ∈ [0, 1),

∞, for ε = 1.

This result is illustrated in Figures 2.4 and 2.5.

-

· · ·

H(Z) H(Z)

clustering points

Figure 2.4: The limiting spectrums of (1/n)hZn(Zn) for Example 2.8

23

-

· · ·

log(2)− H(Z) log(2)−H(Z)

clustering points

Figure 2.5: The possible limiting spectrums of (1/n)iXn,Y n(Xn;Y n) forExample 2.8.

2.6 Renyi’s information measures

In this section, we introduce alternative generalizations of information measures.They are respectively named Renyi’s entropy, Renyi’s mutual information andRenyi’s divergence.

It is known that Renyi’s information measures can be used to provide expo-nentially tight error bounds for data compression block coding and hypothesistesting systems with i.i.d. statistics. Cambell also found its operational charac-terization for data compression variable-length codes [2]. Here, we only providetheir definitions, and will introduce their operational meanings in the subsequentchapters.

Definition 2.9 (Renyi’s entropy) For α > 0, the Renyi’s entropy8 of orderα is defined by:

H(X;α) ,

1

1− αlog

(∑x∈X

[PX(x)]α), for α 6= 1;

limα→1

H(X;α) = −∑x∈X

PX(x) logPX(x), for α = 1.

Definition 2.10 (Renyi’s divergence) For α > 0, the Renyi’s divergence oforder α is defined by:

D(X‖X;α) ,

1

α− 1log

(∑x∈X

[PαX(x)P 1−α

X(x)])

, for α 6= 1;

limα→1

D(X‖X;α) =∑x∈X

PX(x) logPX(x)

PX(x), for α = 1.

8The Renyi’s entropy is usually denoted by Hα(X); however, in order not to confuse withthe ε-inf/sup-entropy rate Hε(X) and Hε(X), we adopt the notation H(X;α) for Renyi’sentropy. Same interpretations apply to other Renyi’s measures.

24

There are two possible Renyi’s extensions for mutual information. One isbased on the observation of

I(X;Y ) =∑x∈X

∑y∈Y

PX(x)PY |X(y|x) logPY |X(y|x)

PY (y)

= minPY

∑x∈X

∑y∈Y


PY (y).

The latter equality can be proved by taking the derivative of∑x∈X

∑y∈Y


PY (y)+ λ

(∑y∈Y

PY (y)− 1

)

with respective to PY (y), and letting it be zero to obtain that the minimizationoutput distribution satisfies

P ∗Y

(y) =∑x∈X

PX(x)PY |X(y|x).

The other extension is a direct generalization of

I(X;Y ) = D(PX,Y ‖PX × PY ) = minPY

D(PX,Y ‖PX × PY )

to Renyi’s divergence.

Definition 2.11 (type-I Renyi’s mutual information) For α > 0, the type-I Renyi’s mutual information of order α is defined by:

I(X;Y ;α) ,

minPY

1

α− 1

∑x∈X

PX(x) log

(∑y∈Y

[PαY |X(y|x)P 1−α

Y(y)])

, if α 6= 1;

limα→1

I(X;Y ;α) = I(X;Y ), if α = 1,

where the minimization is taken over all PY under fixed PX and PY |X .

Definition 2.12 (type-II Renyi’s mutual information) For α > 0, thetype-II Renyi’s mutual information of order α is defined by:

J(X;Y ;α) , minPY

D(PX,Y ‖PX × PY ;α)

=

α

α− 1log

∑y∈Y

(∑x∈X

PX(x)PαY |X(y|x)

)1/α , for α 6= 1;

limα→1

J(X;Y ;α) = I(X;Y ), for α = 1.

25

Some elementary properties of Renyi’s entropy and divergence are summa-rized below.

Lemma 2.13 For finite alphabet X , the following statements hold.

1. 0 ≤ H(X;α) ≤ log |X |; the first equality holds if, and only if, X is de-terministic, and the second equality holds if, and only if, X is uniformlydistributed over X .

2. H(X;α) is strictly decreasing in α unless X is uniformly distributed overits support x ∈ X : PX(x) > 0.

3. limα↓0H(X;α) = log |x ∈ X : PX(x) > 0|.

4. limα→∞H(X;α) = − log maxx∈X PX(x).

5. D(X‖X;α) ≥ 0 with equality holds if, and only if, PX = PX .

6. D(X‖X;α) =∞ if, and only if, either

x ∈ X : PX(x) > 0 and PX(x) > 0 = ∅

orx ∈ X : PX(x) > 0 6⊂ x ∈ X : PX(x) > 0 for α ≥ 1.

7. limα↓0D(X‖X;α) = − logPXx ∈ X : PX(x) > 0.

8. If PX(x) > 0 implies PX(x) > 0, then

limα→∞

D(X‖X;α) = maxx∈X : PX(x)>0

logPX(x)

PX(x).

9. I(X;Y ;α) ≥ J(X;Y ;α) for 0 < α < 1, and I(X;Y ;α) ≤ J(X;Y ;α) forα > 1.

Lemma 2.14 (data processing lemma for type-I Renyi’s mutual infor-mation) Fix α > 0. If X → Y → Z, then I(X;Y ;α) ≥ I(X;Z;α).

26

Bibliography

[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: JohnWiley and Sons, 1995.

[2] L. L. Cambell, “A coding theorem and Renyi’s entropy,” Informat. Contr.,vol. 8, pp. 423–429, 1965.

[3] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York,1990.

[4] T. S. Han and S. Verdu, “Approximation theory of output statistics,” IEEETrans. on Information Theory, vol. IT–39, no. 3, pp. 752–772, May 1993.

[5] S. Verdu and T. S. Han, “A general formula for channel capacity,” IEEETrans. on Information Theory, vol. IT–40, no. 4, pp. 1147–1157, Jul. 1994.

27

Chapter 3

General Lossless Data CompressionTheorems

In Volume I of the lecture notes, we already know that the entropy rate

limn→∞

1

nH(Xn)

is the minimum data compression rate (nats per source symbol) for arbitrarilysmall data compression error for block coding of the stationary-ergodic source.We also mentioned that for a more complicated situations where the sourcesbecomes non-stationary, the quantity limn→∞(1/n)H(Xn) may not exist, andcan no longer be used to characterize the source compression. This results inthe need to establish a new entropy measure which appropriately characterizesthe operational limits of arbitrary stochastic systems, which was done in theprevious chapter.

The role of a source code is to represent the output of a source efficiently.Specifically, a source code design is to minimize the source description rate ofthe code subject to a fidelity criterion constraint. One commonly used fidelitycriterion constraint is to place an upper bound on the probability of decodingerror Pe. If Pe is made arbitrarily small, we obtain a traditional (almost) error-free source coding system1. Lossy data compression codes are a larger classof codes in the sense that the fidelity criterion used in the coding scheme is ageneral distortion measure. In this chapter, we only demonstrate the bounded-error data compression theorems for arbitrary (not necessarily stationary ergodic,information stable, etc.) sources. The general lossy data compression theoremswill be introduced in subsequent chapters.

1Recall that only for variable-length codes, a complete error-free data compression is re-quired. A lossless data compression block codes only dictates that the compression error canbe made arbitrarily small, or asymptotically error-free.

28

3.1 Fixed-length data compression codes for arbitrarysources

Equipped with the general information measures, we herein demonstrate a gen-eralized Asymptotic Equipartition Property (AEP) Theorem and establish ex-pressions for the minimum ε-achievable (fixed-length) coding rate of an arbitrarysource X.

Here, we have made an implicit assumption in the following derivation, whichis the source alphabet X is finite2.

Definition 3.1 (cf. Definition 4.2 and its associated footnote in volumeI) An (n,M) block code for data compression is a set

C∼n , c1, c2, . . . , cM

consisting of M sourcewords3 of block length n (and a binary-indexing codewordfor each sourceword ci); each sourceword represents a group of source symbolsof length n.

Definition 3.2 Fix ε ∈ [0, 1]. R is an ε-achievable data compression rate fora source X if there exists a sequence of block data compression codes C∼n =(n,Mn)∞n=1 with

lim supn→∞

1

nlogMn ≤ R,

andlim supn→∞

Pe( C∼n) ≤ ε,

where Pe( C∼n) , Pr (Xn /∈ C∼n) is the probability of decoding error.

The infimum of all ε-achievable data compression rate for X is denoted byTε(X).

Lemma 3.3 (Lemma 1.5 in [3]) Fix a positive integer n. There exists an(n,Mn) source block code C∼n for PXn such that its error probability satisfies

Pe( C∼n) ≤ Pr

[1

nhXn(Xn) >

1

nlogMn

].

2Actually, the theorems introduced also apply for sources with countable alphabets. Weassume finite alphabets in order to avoid uninteresting cases (such as Hε(X) =∞) that mightarise with countable alphabets.

3In Definition 4.2 of volume I, the (n,M) block data compression code is defined by Mcodewords, where each codeword represents a group of sourcewords of length n. However, wecan actually pick up one source symbol from each group, and equivalently define the codeusing these M representative sourcewords. Later, it will be shown that this viewpoint doesfacilitate the proving of the general source coding theorem.

29

Proof: Observe that

1 ≥∑

xn∈Xn : (1/n)hXn (xn)≤(1/n) logMn

PXn(xn)

≥∑

xn∈Xn : (1/n)hXn (xn)≤(1/n) logMn

1

M n

≥∣∣∣∣xn ∈ X n :

1

nhXn(xn) ≤ 1

nlogMn

∣∣∣∣ 1

M n.

Therefore, |xn ∈ X n : (1/n)hXn(xn) ≤ (1/n) logMn| ≤ Mn. We can thenchoose a code

C∼n ⊃xn ∈ X n :

1

nhXn(xn) ≤ 1

nlogMn

with | C∼n| = Mn and

Pe( C∼n) = 1− PXn C∼n ≤ Pr

[1

nhXn(Xn) >

1

nlogMn

].

2

Lemma 3.4 (Lemma 1.6 in [3]) Every (n,Mn) source block code C∼n for PXn

satisfies

Pe( C∼n) ≥ Pr

[1

nhXn(Xn) >

1

nlogMn + γ

]− exp−nγ,

for every γ > 0.

Proof: It suffices to prove that

1− Pe( C∼n) = Pr Xn ∈ C∼n < Pr

[1

nhXn(Xn) ≤ 1

nlogMn + γ

]+ exp−nγ.

30

Clearly,

Pr Xn ∈ C∼n = Pr

Xn ∈ C∼n and

1

nhXn(Xn) ≤ 1

nlogMn + γ

+ Pr

Xn ∈ C∼n and

1

nhXn(Xn) >

1

nlogMn + γ

≤ Pr

1

nhXn(Xn) ≤ 1

nlogMn + γ

+ Pr

Xn ∈ C∼n and

1

nhXn(Xn) >

1

nlogMn + γ

= Pr

1

nhXn(Xn) ≤ 1

nlogMn + γ

+∑

xn∈ C∼n

PXn(xn) · 1

1

nhXn(xn) >

1

nlogMn + γ

= Pr

1

nhXn(Xn) ≤ 1

nlogMn + γ

+∑

xn∈ C∼n

PXn(xn) · 1PXn(xn) <

1

Mn

exp−nγ

< Pr

1

nhXn(Xn) ≤ 1

nlogMn + γ

+ | C∼n|

1

Mn

exp−nγ

= Pr

1

nhXn(Xn) ≤ 1

nlogMn + γ

+ exp−nγ.

2

We now apply Lemmas 3.3 and 3.4 to prove a general source coding theoremsfor block codes.

Theorem 3.5 (general source coding theorem) For any source X,

Tε(X) =

limδ↑(1−ε)

Hδ(X), for ε ∈ [0, 1);

0, for ε = 1.

Proof: The case of ε = 1 follows directly from its definition; hence, the proofonly focus on the case of ε ∈ [0, 1).

1. Forward part (achievability): Tε(X) ≤ limδ↑(1−ε) Hδ(X)

We need to prove the existence of a sequence of block codes C∼nn≥1 such thatfor every γ > 0,

lim supn→∞

1

nlog | C∼n| ≤ lim

δ↑(1−ε)Hδ(X) + γ and lim sup

n→∞Pe( C∼n) ≤ ε.

31

Lemma 3.3 ensures the existence (for any γ > 0) of a source block codeC∼n = (n,Mn = dexpn(limδ↑(1−ε) Hδ(X) + γ)e) with error probability

Pe( C∼n) ≤ Pr

1

nhXn(Xn) >

1

nlogMn

≤ Pr

1

nhXn(Xn) > lim

δ↑(1−ε)Hδ(X) + γ

.

Therefore,

lim supn→∞

Pe( C∼n) ≤ lim supn→∞

Pr

1

nhXn(Xn) > lim

δ↑(1−ε)Hδ(X) + γ

= 1− lim inf

n→∞Pr

1

nhXn(Xn) ≤ lim

δ↑(1−ε)Hδ(X) + γ

≤ 1− (1− ε) = ε,

where the last inequality follows from

limδ↑(1−ε)

Hδ(X) = sup

θ : lim inf

n→∞Pr

[1

nhXn(Xn) ≤ θ

]< 1− ε

. (3.1.1)

2. Converse part: Tε(X) ≥ limδ↑(1−ε) Hδ(X)

Assume without loss of generality that limγ↑(1−ε) Hδ(X) > 0. We will provethe converse by contradiction. Suppose that Tε(X) < limδ↑(1−ε) Hδ(X). Then(∃ γ > 0) Tε(X) < limδ↑(1−ε) Hδ(X) − 4γ. By definition of Tε(X), there existsa sequence of codes C∼n such that

lim supn→∞

1

nlog | C∼n| ≤

(lim

δ↑(1−ε)Hδ(X)− 4γ

)+ γ (3.1.2)

andlim supn→∞

Pe( C∼n) ≤ ε. (3.1.3)

(3.1.2) implies that1

nlog | C∼n| ≤ lim

δ↑(1−ε)Hδ(X)− 2γ

for all sufficiently large n. Hence, for those n satisfying the above inequality andalso by Lemma 3.4,

Pe( C∼n) ≥ Pr

[1

nhXn(Xn) >

1

nlog | C∼n|+ γ

]− e−nγ

≥ Pr

[1

nhXn(Xn) >

(lim

δ↑(1−ε)Hδ(X)− 2γ

)+ γ

]− e−nγ.

32

Therefore,

lim supn→∞

Pe( C∼n) ≥ 1− lim infn→∞

Pr

[1

nhXn(Xn) ≤ lim

δ↑(1−ε)Hδ(X)− γ

]> ε,

where the last inequality follows from (3.1.1). Thus, a contradiction to (3.1.3)is obtained. 2

A few remarks are made based on the previous theorem.

• Note that as ε = 0, limδ↑(1−ε) Hδ(X) = H(X). Hence, the above theoremgeneralizes the block source coding theorem in [4], which states that theminimum achievable fixed-length source coding rate of any finite-alphabetsource is H(X).

• Consider the special case where −(1/n) logPXn(Xn) converges in proba-bility to a constant H, which holds for all information stable sources4. Inthis case, both the inf- and sup-spectrums of X degenerate to a unit stepfunction:

u(θ) =

1, if θ > H;

0, if θ < H,

where H is the source entropy rate. Thus, Hε(X) = H for all ε ∈ [0, 1).Hence, general source coding theorem reduces to the conventional sourcecoding theorem.

• More generally, if−(1/n) logPXn(Xn) converges in probability to a randomvariable Z whose cumulative distribution function (cdf) is FZ(·), then theminimum achievable data compression rate subject to decoding error beingno greater than ε is

limδ↑(1−ε)

Hδ(X) = sup R : FZ(R) < 1− ε .

Therefore, the relationship between the code rate and the ultimate optimalerror probability is also clearly defined. We further explore the case in thenext example.

4A source X =Xn =

(X

(n)1 , . . . , X

(n)n

)∞n=1

is said to be information stable if H(Xn) =

E [− logPXn(xn)] > 0 for all n, and

limn→∞

Pr

(∣∣∣∣− logPXn(xn)

H(Xn)− 1

∣∣∣∣ > ε

)= 0,

for every ε > 0. By the definition, any stationary-ergodic source with finite n-fold entropy isinformation stable; hence, it can be viewed a generalized source model for stationary-ergodicsources.

33

Example 3.6 Consider a binary source X with each Xn is Bernoulli(Θ)distributed, where Θ is a random variable defined over (0, 1). This is a sta-tionary but non-ergodic source [1]. We can view the source as a mixture ofBernoulli(θ) processes where the parameter θ ∈ Θ = (0, 1), and has distri-bution PΘ [1, Corollary 1]. Therefore, it can be shown by ergodic decom-position theorem (which states that any stationary source can be viewedas a mixture of stationary-ergodic sources) that −(1/n) logPXn(Xn) con-verges in probability to a random variable Z = hb(Θ) [1], where hb(x) ,−x log2(x)−(1−x) log2(1−x) is the binary entropy function. Consequently,the cdf of Z is FZ(z) = Prhb(Θ) ≤ z; and the minimum achievable fixed-length source coding rate with compression error being no larger than εis

supR : FZ(R) < 1− ε = supR : Prhb(Θ) ≤ R < 1− ε.

• From the above example, or from Theorem 3.5, it shows that the strongconverse theorem (which states that codes with rate below entropy ratewill ultimately have decompression error approaching one, cf. Theorem 4.6of Volume I of the lecture notes) does not hold in general. However, onecan always claim the weak converse statement for arbitrary sources.

Theorem 3.7 (weak converse theorem) For any block code sequenceof ultimate rate R < H(X), the probability of block decoding failure Pecannot be made arbitrarily small. In other words, there exists ε > 0 suchthat Pe is lower bounded by ε infinitely often in block length n.

The possible behavior of the probability of block decompression error ofan arbitrary source is depicted in Figure 3.1. As shown in the Figure,there exist two bounds, denoted by H(X) and H0(X), where H(X) isthe tight bound for lossless data compression rate. In other words, it ispossible to find a sequence of block codes with compression rate larger thanH(X) and the probability of decoding error is asymptotically zero. Whenthe data compression rate lies between H(X) and H0(X), the minimumprobability of decoding error achievable is bounded below by a positiveabsolute constant in (0, 1) infinitely often in blocklength n. In the case thatthe data compression rate is less than H0(X), the probability of decodingerror of all codes will eventually go to 1 (for n infinitely often). Thisfact tells the block code designer that all codes with long blocklength arebad when data compression rate is smaller than H0(X). From the strongconverse theorem, the two bounds in Figure 3.1 coincide for memorylesssources. In fact, these two bounds coincide even for stationary-ergodicsources.

34

-

H0(X) H(X)

Pen (i.o.)

−−−→1 Pen→∞−−−→0Pe is lower

bounded (i.o. in n)R

Figure 3.1: Behavior of the probability of block decoding error as blocklength n goes to infinity for an arbitrary source X.

We close this section by remarking that the definition that we adopt for theε-achievable data compression rate is slightly different from, but equivalent to,the one used in [4, Def. 8]. The definition in [4] also brings the same result,which was separately proved by Steinberg and Verdu as a direct consequence ofTheorem 10(a) (or Corollary 3) in [5]. To be precise, they showed that Tε(X),denoted by Te(ε,X) in [5], is equal to Rv(2ε) (cf. Def. 17 in [5]). By a simplederivation, we obtain:

Te(ε,X) = Rv(2ε)

= inf

θ : lim sup

n→∞PXn

[− 1

nlogPXn(Xn) > θ

]≤ ε

= inf

θ : lim inf

n→∞PXn

[− 1

nlogPXn(Xn) ≤ θ

]≥ 1− ε

= sup

θ : lim inf

n→∞PXn

[− 1

nlogPXn(Xn) ≤ θ

]< 1− ε

= lim

δ↑(1−ε)Hδ(X).

Note that Theorem 10(a) in [5] is a lossless data compression theorem for ar-bitrary sources, which the authors show as a by-product of their results onfinite-precision resolvability theory. Specifically, they proved T0(X) = S(X) [4,Thm. 1] and S(X) = H(X) [4, Thm. 3], where S(X) is the resolvability5 ofan arbitrary source X. Here, we establish Theorem 3.5 in a different and moredirect way.

3.2 Generalized AEP theorem

For discrete memoryless sources, the data compression theorem is proved bychoosing the codebook C∼n to be the weakly δ-typical set and applying the Asymp-

5The resolvability, which is a measure of randomness for random variables, will be intro-duced in subsequent chapters.

35

totic Equipartition Property (AEP) which states that (1/n)hXn(Xn) convergesto H(X) with probability one (and hence in probability). The AEP – whichimplies that the probability of the typical set is close to one for sufficiently largen – also holds for stationary-ergodic sources. It is however invalid for more gen-eral sources – e.g., non-stationary, non-ergodic sources. We herein demonstratea generalized AEP theorem.

Theorem 3.8 (generalized asymptotic equipartition property for arbi-trary sources) Fix ε ∈ [0, 1). Given an arbitrary source X, define

Tn[R] ,

xn ∈ X n : − 1

nlogPXn(xn) ≤ R

.

Then for any δ > 0, the following statements hold.

1.lim infn→∞

PrTn[Hε(X)− δ]

≤ ε (3.2.1)

2.lim infn→∞

PrTn[Hε(X) + δ]

> ε (3.2.2)

3. The number of elements in

Fn(δ; ε) , Tn[Hε(X) + δ] \ Tn[Hε(X)− δ],

denoted by |Fn(δ; ε)|, satisfies

|Fn(δ; ε)| ≤ expn(Hε(X) + δ)

, (3.2.3)

where the operation A\B between two sets A and B is defined by A\B ,A ∩ Bc with Bc denoting the complement set of B.

4. There exists ρ = ρ(δ) > 0 and a subsequence nj∞j=1 such that

|Fn(δ; ε)| > ρ · expnj(Hε(X)− δ)

. (3.2.4)

Proof: (3.2.1) and (3.2.2) follows from the definitions. For (3.2.3), we have

1 ≥∑

xn∈Fn(δ;ε)

PXn(xn)

≥∑

xn∈Fn(δ;ε)

exp−n (Hε(X) + δ)

= |Fn(δ; ε)| exp

−n (Hε(X) + δ)

.

36

Tn[Hε(X)− δ]Tn[Hε(X) + δ]

Fn(δ; ε)

Figure 3.2: Illustration of generalized AEP Theorem. Fn(δ; ε) ,Tn[Hε(X) + δ] \ Tn[Hε(X)− δ] is the dashed region.

It remains to show (3.2.4). (3.2.2) implies that there exist ρ = ρ(δ) > 0 andN1 such that for all n > N1,

PrTn[Hε(X) + δ]

> ε+ 2ρ.

Furthermore, (3.2.1) implies that for the previously chosen ρ, there exist a sub-sequence n′j∞j=1 such that

PrTn′j [Hε(X)− δ]

< ε+ ρ.

Therefore, for all n′j > N1,

ρ < PrTn′j [Hε(X) + δ] \ Tn′j [Hε(X)− δ]

<

∣∣∣Tn′j [Hε(X) + δ] \ Tn′j [Hε(X)− δ]∣∣∣ exp

−n′j (Hε(X)− δ)

=

∣∣∣Fn′j(δ; ε)∣∣∣ exp−n′j (Hε(X)− δ)

.

The desired subsequence nj∞j=1 is then defined as n1 is the first n′j > N1, andn2 is the second n′j > N1, etc. 2

With the illustration depicted in Figure 3.2, we can clearly deduce that The-orem 3.8 is indeed a generalized version of the AEP since:

• The set

Fn(δ; ε) , Tn[Hε(X) + δ] \ Tn[Hε(X)− δ]

=

xn ∈ X n :

∣∣∣∣− 1

nlogPXn(xn)− Hε(X)

∣∣∣∣ ≤ δ

is nothing but the weakly δ-typical set.

37

• (3.2.1) and (3.2.2) imply that qn , PrFn(δ; ε) > 0 infinitely often in n.

• (3.2.3) and (3.2.4) imply that the number of sequences in Fn(δ; ε) (thedashed region) is approximately equal to exp

nHε(X)

, and the probabi-

lity of each sequence in Fn(δ; ε) can be estimated by qn · exp−nHε(X)

.

• In particular, if X is a stationary-ergodic source, then Hε(X) is indepen-dent of ε ∈ [0, 1) and, Hε(X) =Hε(X) = H for all ε ∈ [0, 1), where H isthe source entropy rate

H = limn→∞

1

nE [− logPXn(Xn)] .

In this case, (3.2.1)-(3.2.2) and the fact that Hε(X) =Hε(X) for all ε ∈[0, 1) imply that the probability of the typical set Fn(δ; ε) is close to one(for n sufficiently large), and (3.2.3) and (3.2.4) imply that there are aboutenH typical sequences of length n, each with probability about e−nH . Hencewe obtain the conventional AEP.

• The general source coding theorem can also be proved in terms of thegeneralized AEP theorem. For details, readers can refer to [2].

3.3 Variable-length lossless data compression codesthat minimizes the exponentially weighted codewordlength

3.3.1 Criterion for optimality of codes

In the usual discussion of the source coding theorem, one chooses the criterionto minimize the average codeword length. Implicit in the use of average codelength as a criterion of performance is the assumption that cost varies linearlywith codeword length. This is not always the case. In some papers, anothercost/penalty function of codeword length is introduced which implies that thecost is an exponential function of codeword length. Obviously, linear cost is alimiting case of the exponential cost function.

One of the exponential cost function introduced is:

L(t) ,1

tlog

(∑x∈X

PX(x)et·`(cx)

),

where t is a chosen positive constant, PX is the distribution function of sourceX, cx is the binary codeword for source symbol x, and `(·) is the length of the

38

binary codeword. The criterion for optimal code now becomes that a code issaid to be optimal if its cost L(t) is the smallest among all possible codes.

The physical meaning of the aforementioned cost function is roughly dis-cussed below. When t → 0, L(0) =

∑x∈X PX(x)`(cx) which is the average

codeword length. In the case of t → ∞, L(∞) = maxx∈X `(cx), which is themaximum codeword length for all binary codewords. As you may have noticed,longer codeword length has larger weight in the sum of L(t). In other words,the minimization of L(t) is equivalent to the minimization of

∑x∈X PX(x)et`(cx);

hence, the weight for codeword cx is et`(cx). For the minimization operation, itis obvious that events with smaller weight is more preferable since it contributesless in the sum of L(t).

Therefore, with the minimization of L(t), it is less likely to have codewordswith long code lengths. In practice, system with long codewords usually intro-duce complexity in encoding and decoding, and hence is somewhat non-feasible.Consequently, the new criterion, to some extent, is more suitable to physicalconsiderations.

3.3.2 Source coding theorem for Renyi’s entropy

Theorem 3.9 (source coding theorem for Renyi’s entropy) The mini-mum cost L(t) attainable for uniquely decodable codes6 is the Renyi’s entropyof order 1/(1 + t), i.e.,

H

(X;

1

(1 + t)

)=

1 + t

tlog

(∑x∈X

P1/(1+t)X (x)

).

Example 3.10 Given a source X. If we want to design an optimal lossless codewith maxx∈X `(cx) being smallest, the cost of the optimal code is H(X; 0) =log |X |.

3.4 Entropy of English

The compression of English text is not only a practical application but also aninteresting research topic.

One of the main problem in the compression of English text (in principle)is that its statistical model is unclear. Therefore, its entropy rate cannot beimmediately computable. To estimate the data compression bound of Englishtext, various stochastic approximations to English have been proposed. One can

6For its definition, please refer to Section 4.3.1 of Volume I of the lecture notes.

39

then design a code, according to the estimated stochastic model, to compressEnglish text. It is obvious that the better the stochastic approximation, thebetter the approximation.

Assumption 3.11 For data compression, we assume that the English text con-tains only 26 letters and the space symbol. In other words, the upper case letteris treated the same as its lower case counterpart, and special symbols, such aspunctuation, will be ignored.

3.4.1 Markov estimate of entropy rate of English text

According to the source coding theorem, the first thing that a data compressioncode designer shall do is to estimate the (δ-sup) entropy rate of the English text.One can start from modeling the English text as a Markov source, and computethe entropy rate according to the estimated Markov statistics.

zero-order Markov approximation. It has been shown that the frequencyof letters in English is far from uniform; for example, the most commonletter, E, has Pempirical(E) ≈ 0.13 but the least common letters, Q and Z,have Pempirical(Q) ≈ Pempirical(Z) ≈ 0.001. Therefore, zero-order Markovapproximation apparently does not fit our need in estimating the entropyrate of the English text.

1st-order Markov approximate. The frequency of pairs of letters is also farfrom uniform; the most common pair TH has frequency about 0.037. It isfun to know that Q is always followed by U.

A higher order approximation is possible. However, the database may betoo large to be handled. For example, a 2nd-order approximation requires 273 =19683 entries, and one may need millions of samples to make an accurate estimateof its probability.

Here are some examples of Markov approximations to English from Shannon’soriginal paper. Note that the sequence of English letters is generated accordingto the approximated statistics.

1. Zero-order Markov approximation: The symbols are drawn independentlywith equiprobable distribution.

Example: XFOML RXKHRJEFJUJ ZLPWCFWKCYJEFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD

40

2. 1st-order Markov approximation: The symbols are drawn independently.Frequency of letters matches the 1st-order Markov approximation of En-glish text.

Example: OCRO HLI RGWR NMIELWIS EU LLNBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL

3. 2nd-order Markov approximation: Frequency of pairs of letters matchesEnglish text.

Example: ON IE ANTSOUTINYS ARE T INCTORE ST BES DEAMY ACHIN D ILONASIVE TUCOOWE ATTEASONARE FUSO TIZIN AN DY TOBE SEACE CTISBE

4. 3rd-order Markov approximation: Frequency of triples of letters matchesEnglish text.

Example: IN NO IST LAT WHEY CRATICT FROURE BERSGROCID PONDENOME OF DEMONSTURES OF THEREPTAGIN IS REGOACTIONA OF CRE

5. 4th-order Markov approximation: Frequency of quadruples of letters ma-tches English text.

Example: THE GENERATED JOB PROVIDUAL BETTERTRAND THE DISPLAYED CODE, ABOVERY UPONDULTSWELL THE CODE RST IN THESTICAL IT DO HOCKBOTHE MERG.

Another way to simulate the randomness of English text is to use word-approximation.

1. 1st-order word approximation.

Example: REPRESENTING AND SPEEDILY IS AN GOOD APT ORCOME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THETO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MES-SAGE HAD BE THESE.

2. 2nd-order word approximation.

Example: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISHWRITER THAT THE CHARACTER OF THIS POINT IS THEREFOREANOTHER METHOD FOR THE LETTERS THAT THE TIME OFWHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

41

From the above results, it is obvious that the approximations get closer andcloser to resembling English for higher order approximation.

Using the above model, we can then compute the empirical entropy rate ofEnglish text.

order of the letter approximation model entropy ratezero order log2 27 = 4.76 bits per letter1st order 4.03 bits per letter4th order 2.8 bits per letter

One final remark on Markov estimate of English statistics is that the resultsare not only useful in compression but also helpful in decryption.

3.4.2 Gambling estimate of entropy rate of English text

In this section, we will show that a good gambler is also a good data compressor!

A) Sequential gambling

Given an observed sequence of letters,

x1, x2, . . . , xk,

a gambler needs to bet on the next letter Xk+1 (which is now a random variable)with all the money in hand. It is not necessary for him to put all the moneyon the same outcome (there are 27 of them, i.e., 26 letters plus “space”). Forexample, he is allowed to place part of his money on one possible outcome andthe rest of the money on another. The only constraint is that he should bet allhis money.

Let b(xk+1|x1, . . . , xk) be the ratio of his money, which be bet on the letterxk+1, and assume that at first, the amount of money that the gambler has is 1.Then ∑

xk+1∈X

b(xk+1|x1, . . . , xk) = 1,

and(∀ xk+1 ∈ a, b, . . . , z, SPACE) b(xk+1|x1, . . . , xk) ≥ 0.

When the next letter appears, the gambler will be paid 27 times the bet on theletter.

42

Let Sn be the wealth of the gambler after n bets. Then

S1 = 27 · b1(x1)

S2 = 27 · [b2(x2|x1)× S1]

S3 = 27 · [b3(x3|x1, x2)× S2]...

Sn = 27n ·n∏k=1

bk(xk|x1, . . . , xk−1)

= 27n · b(x1, . . . , xn),

where

b(xn) = b(x1, . . . , xn) =n∏k=1

bk(xk|x1, . . . , xk−1).

We now wish to show that high value of Sn lead to high data compression.Specifically, if a gambler with some gambling policy yields wealth Sn, the datacan be saved up to log2 Sn bits.

Lemma 3.12 If a proportional gambling policy results in wealth E[log2 Sn],there exists a data compression code for English-text source Xn which yieldsaverage codeword length being smaller than

log2(27)− 1

nE [log2 Sn] +

2

nbits,

where Xn represents the random variables of the n bet outcomes.

Proof:

1. Ordering : Index the English letter as

index(a) = 0index(b) = 1index(c) = 2

...index(z) = 25index(SPACE) = 26.

For any two sequences xn and xn in X , a, b, . . . , z, SPACE, we sayxn ≥ xn if

n∑i=1

index(xi)× 27i−1 ≥n∑i=1

index(xi)× 27i−1.

43

2. Shannon-Fano-Elias coder : Apply Shannon-Fano-Elias coder (cf. Sec-tion 4.3.3 of Volume I of the lecture notes) to the gambling policy b(xn)according to the ordering defined in step 1. Then the codeword length forxn is

(d− log2(b(xn))e+ 1) bits.

3. Data compression : Now observe that

E [log2 Sn] = E [log2(27nb(xn))]

= n log2(27) + E[log2 b(Xn)].

Hence, the average codeword length ¯ is upper bounded by

¯ ≤ 1

n

(−∑xn∈Xn

PXn(xn) log2 b(xn) + 2

)

= − 1

nE [log2 b(X

n)] +2

n

= log2(27)− 1

nE[log2 Sn] +

2

n.

2

According to the concept behind the source coding theorem, the entropy rateof the English text should be upper bounded by the average codeword length ofany (variable-length) code, which in turns should be bounded above by

log2(27)− 1

nE[log2 Sn] +

2

n.

Equipped with the proportional gambling model, one can find the bound ofthe entropy rate of English text by a properly designed gambling policy. Anexperiment using the book, Jefferson the Virginian by Dumas Malone, as thedatabase resulted in an estimate of 1.34 bits per letter for the entropy rate ofEnglish.

3.5 Lempel-Ziv code revisited

In Section 4.3.4 of Volume I of the lecture notes, we have introduced the famousLempel-Ziv coder, and states that the coder is universally good for stationarysources. In this section, we will establish the concept behind it.

For simplicity, we assume that the source alphabet is binary, i.e., X = 0, 1.The optimality of the Lempel-Ziv code can actually be extended to any station-ary source with finite alphabet.

44

Lemma 3.13 The number c(n) of distinct strings in the Lempel-Ziv parsing ofa binary sequence satisfies

√2n− 1 ≤ c(n) ≤ 2n

log2 n,

where the upper bound holds for n ≥ 213, and the lower bound is valid for everyn.

Proof: The upper bound can be proved as follows.

For fixed n, the number of distinct strings is maximized when all the phrasesare as short as possible. Hence, in the extreme case,

n =

c(n) of them︷︸︸︷1 + 2 + 2 + 3 + 3 + 3 + 3 + · · ·,

which implies thatk+1∑j=1

j2j ≥ n ≥k∑j=1

j2j, (3.5.1)

where k is the integer satisfying

2k+1 − 1 > c(n) ≥ 2k − 1. (3.5.2)

Observe that

k+1∑j=1

j2j = k2k+2 + 2 andk∑j=1

j2j = (k − 1)2k+1 + 2.

Now from (3.5.1), we obtain for k ≥ 7 (which will be justified later by n ≥ 213),

n ≤ k2k+2 + 2 < 22(k−1) and n ≥ (k − 1)2k+1 + 2 ≥ (k − 1)2k+1.

The proof of the upper bound is then completed by noting that the maximumc(n) for fixed n should satisfy

c(n) < 2k+1 − 1 ≤ 2k+1 ≤ n

k − 1≤ 2n

log2(n).

Again for the lower bound, we note that the number of distinct strings isminimized when all the phrases are as long as possible. Hence, in the extremecase,

n =

c(n) of them︷︸︸︷1 + 2 + 3 + 4 + · · · ≤

c(n)∑j=1

j =c(n)[c(n) + 1]

2≤ [c(n) + 1]2

2,

45

which implies thatc(n) ≥

√2n− 1.

Note that when n ≥ 213, c(n) ≥ 27 − 1, which implies that the assumption ofk ≥ 7 in (3.5.2) is valid for n ≥ 213. 2

The condition n ≥ 213 is equivalent to compressing a binary file of size largerthan 1K bytes, which, in practice, is a frequently encountered situation. Sincewhat concerns us is the asymptotic optimality of the coder as n goes to infinity,n ≥ 213 certainly becomes insignificant in such consideration.

Lemma 3.14 (entropy upper bound by a function of its mean) A non-negative integer-valued source X with mean µ and entropy H(X) satisfies

H(X) ≤ (µ+ 1) log2(µ+ 1)− µ log2 µ.

Proof: The lemma follows directly from the result that the geometric distri-bution maximizes the entropy of non-negative integer-valued source with givenmean, which is proved as follows.

For geometric distribution with mean µ,

PZ(z) =1

1 + µ

(µ

1 + µ

)z, for z = 0, 1, 2, . . . ,

its entropy is

H(Z) =∞∑z=0

−PZ(z) log2 PZ(z)

=∞∑z=0

PZ(z)

[log2(1 + µ) + z · log2

1 + µ

µ

]= log2(1 + µ) + µ log2

1 + µ

µ

=∞∑z=0

PX(z)

[log2(1 + µ) + z · log2

1 + µ

µ

],

where the last equality holds for any non-negative integer-valued source X withmean µ. So,

H(X)−H(Z) =∞∑x=0

PX(x)[− log2 PX(x) + log2 PZ(x)]

=∞∑x=0

PX(x) log2

PZ(x)

PX(x)

= −D(PX‖PZ) ≤ 0,

46

with equality holds if, and only if, X ≡ Z. 2

Before the introduction of the main theorems, we address some notationsused in their proofs.

Give the sourcex−(k−1), . . . , x−1, x0, x1, . . . , xn,

and suppose x1, . . . , xn is Lempel-Ziv-parsed into c distinct strings, y1, . . . ,yc.Let νi be the location of the first bit of yi, i.e.,

yi , xνi , . . . , xνi+1−1.

1. Definesi = xνi−k, . . . , xνi−1

as the k bits preceding yi.

2. Define c`,s be the number of strings in y1, . . . ,yc with length ` and pro-ceeding state s.

3. Define Qk be the k-th order Markov approximation of the stationary sourceX, i.e.,

Qk(x1, . . . , xn|x0, . . . , x−(k−1)) , PXn1 |X0

−(k−1)(xn, . . . , x1|x0, . . . , x−(k−1)),

where PXn1 |X0

−(k−1)is the true (stationary) distribution of the source.

For ease of understanding, these notations are graphically illustrated in Fig-ure 3.3. It is easy to verify that

n∑`=1

∑s∈Xk

c`,s = c, andn∑`=1

∑s∈Xk

`× c`,s = n. (3.5.3)

Lemma 3.15 For any Lempel-Ziv parsing of the source x1 . . . xn, we have

log2Qk(x1, . . . , xn|s) ≤n∑`=1

−c`,s log2 c`,s. (3.5.4)

47

s1︷︸︸︷x−(k−1), . . . , x−1(x1, . . . ,

s2︷︸︸︷xν2−(k−1), . . . , xν2−1︸︷︷︸

y1

)(xν2 , . . . ,

s3︷︸︸︷xν3−(k−1), . . . , xν3−1︸︷︷︸

y2

)

. . . (xνc−1 , . . . ,

sc︷︸︸︷xνc−(k−1), . . . , xνc−1︸︷︷︸yc−1

)(xνc , . . . , xn︸︷︷︸yc

)

Figure 3.3: Notations used in Lempel-Ziv coder.

Proof: By the k-th order Markov property of Qk,

log2Qk(x1, . . . , xn|s)

= log2Q(y1, . . . ,yc|x)

= log2

(c∏i=1

PXνi+1−νi1 |X0

−(k−1)

(yi|si)

)

=c∑i=1

log2 PXνi+1−νi1 |X0

−(k−1)

(yi|si)

=n∑`=1

∑s∈Xk

∑i : νi+1−νi=` and si=s

log2 PX`1|X0−(k−1)

(yi|si)

=

n∑`=1

∑s∈Xk

c`,s

1

c`,s


log2 PX`1|X0−(k−1)

(yi|si)

≤

n∑`=1

∑s∈Xk

c`,s log2


(1

c`,sPX`

1|X0−(k−1)

(yi|si))

(3.5.5)

≤n∑`=1

∑s∈Xk

c`,s log2

(1

c`,s

)(3.5.6)

where (3.5.5) follows from Jensen’s inequality and the concavity of log2(·), and(3.5.6) holds since probability sum is no greater than one. 2

Theorem 3.16 Fix a stationary source X. Given any observations x1, x2, x3,. . .,

lim supn→∞

c log2 c

n≤ H(Xk+1|Xk, . . . , X1),

for any integer k, where c = c(n) is the number of distinct Lempel-Ziv parsedstrings of x1, x2, . . . , xn.

48

Proof: Lemma 3.15 gives that

log2Qk(x1, . . . , xn|s) ≤n∑`=1

∑s∈Xk−c`,s log2 c`,s

=n∑`=1

∑s∈Xk−c`,s log2

c`,sc

+n∑`=1

∑s∈Xk−c`,s log2 c

= −cn∑`=1

∑s∈Xk

c`,sc

log2

c`,sc− c log2 c. (3.5.7)

Denote by L and S the random variables with distribution

PL,S(`, s) =c`,sc,

for which the sum-to-one property of the distribution is justified by (3.5.3). Also,from (3.5.3), we have

E[L] =n∑`=1

∑s∈Xk

`× c`,sc

=n

c.

Therefore, by independent bound for entropy (cf. Theorem 2.19 in Volume I ofthe lecture notes) and Lemma 3.14, we get

H(L,S) ≤ H(L) +H(S)

≤ (E[L] + 1) log2(E[L] + 1)− E[L] log2E[L]+ log2 |X |k

=[(nc

+ 1)

log2

(nc

+ 1)− n

clog2

n

c

]+ k

=

[log2

(nc

+ 1)

+n

clog2

(n/c+ 1

n/c

)]+ k

= log2

(nc

+ 1)

+n

clog2

(1 +

c

n

)+ k,

which, together with Lemma 3.13, implies that for n ≥ 213,

c

nH(L,S) ≤ c

nlog2

(nc

+ 1)

+ log2

(1 +

c

n

)+c

nk

≤ 2

log2 nlog2

(n√

2n− 1+ 1

)+ log 2

(1 +

2

log2 n

)+

2

log2 nk.

Finally, we can re-write (3.5.7) as

c log2 c

n≤ − 1

nlog2Qk(x1, . . . , xn|x) +

c

nH(L,S).

49

As a consequence, by taking the expectation value with respect to Xn−(k−1) on

both sides of the above inequality, we obtain

lim supn→∞

c log2 c

n≤ lim sup

n→∞

1

nE[− log2Qk(X1, . . . , Xn|X−(k−1), . . . , X0)

]= H(Xk+1|Xk, . . . , X1).

2

Theorem 3.17 (main result) Let `(x1, . . . , xn) be the Lempel-Ziv codewordlength of an observatory sequence x1, . . . , xn, which is drawn from a stationarysource X. Then

lim supn→∞

1

n`(x1, . . . , xn) ≤ lim

n→∞

1

nH(Xn).

Proof: Let c(n) be the number of the parsed distinct strings, then

1

n`(x1, . . . , xn) =

1

nc(n) (dlog2 c(n)e+ 1)

≤ 1

nc(n) (log2 c(n) + 2)

=c(n) log2 c(n)

n+ 2

c(n)

n.

From Lemma 3.13, we have

lim supn→∞

c(n)

n≤ lim sup

n→∞

2

log2 n= 0.

From Theorem 3.16, we have for any integer k,

lim supn→∞

c(n) log2 c(n)

n≤ H(Xk+1|Xk, . . . , X1).

Hence,

lim supn→∞

1

n`(x1, . . . , xn) ≤ H(Xk+1|Xk, . . . , X1)

for any integer k. The theorem is completed by applying Theorem 4.10 in VolumeI of the lecture notes, which states that for a stationary source X, its entropyrate always exists and is equal to

limn→∞

1

nH(Xn) = lim

k→∞H(Xk+1|Xk, . . . , X1).

2

We conclude the discussion in the section into the next corollary.

50

Corollary 3.18 The Lempel-Ziv coders asymptotically achieves the entropyrate of any (unknown) stationary source.

The Lempel-Ziv code is often used in practice to compress data which cannotbe characterized in a simple statistical model, such as English text or computersource code. It is simple to implement, and has an asymptotic rate approachingthe entropy rate (if it exists) of the source, which is known to be the lowerbound of the lossless data compression code rate. This code can be used withoutknowledge of the source distribution provided the source is stationary. Somewell-known examples of its implementation are the compress program in UNIXand the arc program in DOS, which typically compresses ASCII text files byabout a factor of 2.

51

Bibliography

[1] F. Alajaji and T. Fuja, “A communication channel modeled on contagion,”IEEE Trans. on Information Theory, vol. IT–40, no. 6, pp. 2035–2041,Nov. 1994.

[2] P.-N. Chen and F. Alajaji, “Generalized source coding theorems and hy-pothesis testing,” Journal of the Chinese Institute of Engineering, vol. 21,no. 3, pp. 283-303, May 1998.

[3] T. S. Han, Information-Spectrum Methods in Information Theory, (inJapanese), Baifukan Press, Tokyo, 1998.


[5] Y. Steinberg and S. Verdu, “Simulation of random processes and rate-distortion theory,” IEEE Trans. on Information Theory, vol. IT–42, no. 1,pp. 63–86, Jan. 1996.

52

Chapter 4

Measure of Randomness for StochasticProcesses

In the previous chapter, it is shown that the sup-entropy rate is indeed theminimum lossless data compression ratio achievable for block codes. Hence, tofind an optimal block code becomes a well-defined mission since for any sourcewith well-formulated statistical model, the sup-entropy rate can be computedand such quantity can be used as a criterion to evaluate the optimality of thedesigned block code.

In the very recent work of Verdu and Han [2], they found that, other thanthe minimum lossless data compression ratio, the sup-entropy rate actually hasanother operational meaning, which is called resolvability. In this chapter, wewill explore the new concept in details.

4.1 Motivation for resolvability : measure of randomnessof random variables

In simulations of statistical communication systems, generation of random vari-ables by a computer algorithm is very essential. The computer usually has theaccess to a basic random experiment (through pre-defined Application Program-ing Interface), which generates equally likely random values, such as rand( ) thatgenerates a real number uniformly distributed over (0, 1). Conceptually, randomvariables with complex models are more difficult to generate by computerw thanrandom variables with simple models. Question is how to quantify the “com-plexity” of generating the random variables by computers. One way to definesuch “complexity” measurement is:

Definition 4.1 The complexity of generating a random variable is defined asthe number of random bits that the most efficient algorithm requires in order

53

to generate the random variable by computer that has the access to a equallylikely random experiment.

To understand the above definition quantitatively, a simple example is de-monstrated below.

Example 4.2 Consider the generation of the random variable with probabilitymasses PX(−1) = 1/4, PX(0) = 1/2, and PX(1) = 1/4. An algorithm is writtenas:

Flip-a-fair-coin; \\ one random bitIf “Head”, then output 0;elseFlip-a-fair-coin; \\ one random bitIf “Head”, then output −1;else output 1;

On the average, the above algorithm requires 1.5 coin flips, and in the worst-case, 2 coin flips are necessary. Therefore, the complexity measure can take twofundamental forms: worst-case or average-case over the range of outcomes ofthe random variables. Note that we did not show in the above example thatthe algorithm is the most efficient one in the sense of using minimum number ofrandom bits; however, it is indeed an optimal algorithm because it achieves thelower bound of the minimum number of random bits. Later, we will show thatsuch bound for average minimum number of random bits required for generat-ing the random variables is the entropy, which is exactly 1.5 bits in the aboveexample. As for the worse-case bound, a new terminology, resolution, will beintroduced. As a result, the above algorithm also achieves the lower bound ofthe worst-case complexity, which is the resolution of the random variable.

4.2 Notations and definitions regarding to resolvability

Definition 4.3 (M-type) For any positive integer M , a probability distribu-tion P is said to be M -type if

P (ω) ∈

0,1

M,

2

M, . . . , 1

for all ω ∈ Ω.

54

Definition 4.4 (resolution of a random variable) The resolution1 R(X) ofa random variable X is the minimum log(M) such that PX is M -type. If PX isnot M -type for any integer M , then R(X) =∞.

As revealed previously, a random source needs to be resolved (meaning, canbe generated by a computer algorithm with access to equal-probable randomexperiments). As anticipated, a random variable with finite resolution is resolv-able by computer algorithms. Yet, it is possible that the resolution of a randomvariable is infinity. A quick example is the random variable X with distributionPX(0) = 1/π and PX(1) = 1 − 1/π. (X does not belong to any M -type for fi-nite M .) In such case, one can alternatively choose another computer-resolvablerandom variable, which resembles the true source within some acceptable range,to simulate the original one.

One criterion that can be used as a measure of resemblance of two randomvariables is the variational distance. As for the same example in the aboveparagraph, choose a random variable X with distribution PX(0) = 1/3 and

PX(1) = 2/3. Then ‖X − X‖ ≈ 0.03, and X is 3-type, which is computer-resolvable2.

Definition 4.5 (variational distance) The variational distance (or `1 dis-tance) between tow distributions P and Q defined on common measurable space(Ω,F) is

‖P −Q‖ ,∑ω∈Ω

|P (ω)−Q(ω)|.

1If the base of the logarithmic operation is 2, the resolution is measured in bits; however,if natural logarithm is taken, nats becomes the basic measurement unit of resolution.

2A program that generates M -type random variable for any M satisfying log2(M) being a

positive integer is straightforward. A program that generates the 3-type X is as follows (in Clanguage).

even = False;while (1)Flip-a-fair-coin; \\ one random bitif (Head)if (even==True) output 0; even=False;else output 1; even = True;

elseif (even==True) even=False;else even=True;

55

(Note that an alternative way to formulate the variational distance is:

‖P −Q‖ = 2 · supE∈F|P (E)−Q(E)| = 2

∑x∈X : P (x)≥Q(x)

[P (x)−Q(x)].

This two definitions are actually equivalent.)

Definition 4.6 (ε-achievable resolution) Fix ε ≥ 0. R is an ε-achievableresolution for input X if for all γ > 0, there exists X satisfies

R(X) < R + γ and ‖X − X‖ < ε.

ε-achievable resolution reveals the possibility that one can choose anothercomputer-resolvable random variable whose variational distance to the true sour-ce is within an acceptable range, ε.

Next we define the ε-achievable resolution rate for a sequence of random vari-ables, which is an extension of ε-achievable resolution defined for single randomvariable. Such extension is analogous to extending entropy for a single source toentropy rate for a random source sequence.

Definition 4.7 (ε-achievable resolution rate) Fix ε ≥ 0 and input X. R isan ε-achievable resolution rate3 for input X if for every γ > 0, there exists Xsatisfies

1

nR(Xn) < R + γ and ‖Xn − Xn‖ < ε,

for all sufficiently large n.

Definition 4.8 (ε-resolvability for X) Fix ε > 0. The ε-resolvability for in-put X, denoted by Sε(X), is the minimum ε-achievable resolution rate of thesame input, i.e.,

Sε(X) , minR : (∀ γ > 0)(∃X and N)(∀ n > N)

1

nR(Xn) < R + γ and ‖Xn − Xn‖ < ε

.

3Note that our definition of resolution rate is different from its original form (cf. Definition7 in [2] and the statements following Definition 7 of the same paper for its modified Definitionfor specific input X), which involves an arbitrary channel model W . Readers may treat ourdefinition as a special case of theirs over identity channel.

56

Here, we define Sε(X) using the “minimum” instead of a more general “in-fimum” operation is simply because Sε(X) indeed belongs to the range of theminimum operation, i.e.,

Sε(X) ∈R : (∀ γ > 0)(∃X and N)(∀ n > N)

1

nR(Xn) < R + γ and ‖Xn − Xn‖ < ε

.

Similar convention will be applied throughout the rest of this chapter.

Definition 4.9 (resolvability for X) The resolvability for input X, denotedby S(X), is

S(X) , limε↓0

Sε(X).

From the definition of ε-resolvability, it is obvious non-increasing in ε. Hence,the resolvability can also be defined using supremum operation as:

S(X) , supε>0

Sε(X).

The resolvability is pertinent to the worse-case complexity measure for ran-dom variables (cf. Example 4.2, and the discussion following it). With the en-tropy function, the information theorists also define the ε-mean-resolvability andmean-resolvability for input X, which characterize the average-case complexityof random variables.

Definition 4.10 (ε-mean-achievable resolution rate) Fix ε ≥ 0. R is anε-mean-achievable resolution rate for input X if for all γ > 0, there exists Xsatisfies

1

nH(Xn) < R + γ and ‖Xn − Xn‖ < ε,

for all sufficiently large n.

Definition 4.11 (ε-mean-resolvability for X) Fix ε > 0. The ε-mean-re-solvability for input X, denoted by Sε(X), is the minimum ε-mean achievableresolution rate for the same input, i.e.,

Sε(X) , minR : (∀ γ > 0)(∃X and N)(∀ n > N)

1

nH(Xn) < R + γ and ‖Xn − Xn‖ < ε

.

57

Definition 4.12 (mean-resolvability for X) The mean-resolvability for in-put X, denoted by S(X), is

S(X) , limε↓0

Sε(X) = supε>0

Sε(X).

The only difference between resolvability and mean-resolvability is that theformer employs resolution function, while the latter replaces it by entropy func-tion. Since entropy is the minimum average codeword length for uniquely de-codable codes, an explanation for mean-resolvability is that the new randomvariable X can be resolvable through realizing the optimal variable-length codefor it. You can think of the probability mass of each outcome of X is approxi-mately 2−` where ` is the codeword length of the optimal lossless variable-lengthcode for X. Such probability mass can actually be generated by flipping faircoins ` times, and the average number of fair coin flipping for this outcome isindeed ` × 2−`. As you may expect, the mean-resolvability is shown to be theaverage complexity of a random variable.

4.3 Operational meanings of resolvability and mean-re-solvability

The operational meanings for the resolution and entropy (a new operationalmeaning for entropy other than the one from source coding theorem) follow thenext theorem.

Theorem 4.13 For a single random variable X,

1. the worse-case complexity is lower-bounded by its resolution R(X) [2];

2. the average-case complexity is lower-bounded by its entropy H(X), and isupper-bounded by entropy H(X) plus 2 bits [3].

Next, we reveal the operational meanings for resolvability and mean-resolva-bility in source coding. We begin with some useful lemmas that are useful incharacterizing the resolvability.

Lemma 4.14 (bound on variational distance) For every µ > 0,

‖P −Q‖ ≤ 2µ+ 2 · PX[x ∈ X : log

P (x)

Q(x)> µ

].

58

Proof:

‖P −Q‖= 2

∑x∈X : P (x)≥Q(x)

[P (x)−Q(x)]

= 2∑

x∈X : log[P (x)/Q(x)]≥0[P (x)−Q(x)]

= 2

∑x∈X : log[P (x)/Q(x)]>µ

[P (x)−Q(x)]

+∑

x∈X : µ≥log[P (x)/Q(x)]≥0[P (x)−Q(x)]

≤ 2

∑x∈X : log[P (x)/Q(x)]>µ

P (x)

+∑

x∈X : µ≥log[P (x)/Q(x)]≥0P (x)

[1− Q(x)

P (x)

]≤ 2

(P

[x ∈ X : log

P (x)

Q(x)> µ

]

+∑

x∈X : µ≥log[P (x)/Q(x)]≥0P (x)

[log

P (x)

Q(x)

](by fundamental inequality)

≤ 2

P [x ∈ X : logP (x)

Q(x)> µ

]+

∑x∈X : µ≥log[P (x)/Q(x)]≥0

P (x) · µ

= 2

(P

[x ∈ X : log

P (x)

Q(x)> µ

]+µ · PX

[x ∈ X : µ ≥ log

P (x)

Q(x)≥ 0

])= 2

(P

[x ∈ X : log

P (x)

Q(x)> µ

]+ µ

).

59

2

Lemma 4.15

PXn

xn ∈ X n : − 1

nlogPXn(xn) ≤ 1

nR(Xn)

= 1,

for every n.

Proof: By definition of R(Xn),

PXn(xn) ≥ exp−R(Xn)

for all xn ∈ X n. Hence, for all xn ∈ X n,

− 1

nlogPXn(xn) ≤ 1

nR(Xn).

The lemma then holds. 2

Theorem 4.16 The resolvability for input X is equal to its sup-entropy rate,i.e.,

S(X) = H(X).

Proof:

1. S(X) ≥ H(X).

It suffices to show that S(X) < H(X) contradicts to Lemma 4.15.

Suppose S(X) < H(X). Then there exists δ > 0 such that

S(X) + δ < H(X).

Let

D0 ,

xn ∈ X n : − 1

nlogPXn(xn) ≥ S(X) + δ

.

By definition of H(X),

lim supn→∞

PXn(D0) > 0.

Therefore, there exists α > 0 such that

lim supn→∞

PXn(D0) > α,

60

which immediately implies

PXn(D0) > α

infinitely often in n.

Select 0 < ε < minα2, 1 and observe that Sε(X) ≤ S(X), we can choose

Xn to satisfy

1

nR(Xn) < S(X) +

δ

2and ‖Xn − Xn‖ < ε

for sufficiently large n. Define

D1 , xn ∈ X n : PXn(xn) > 0

and∣∣PXn(xn)− PXn(xn)

∣∣ ≤ √ε · PXn(xn).

Then

PXn(Dc1) = PXn xn ∈ X n : PXn(xn) = 0

or∣∣PXn(xn)− PXn(xn)

∣∣ > √ε · PXn(xn)

≤ PXn xn ∈ X n : PXn(xn) = 0+PXn

xn ∈ X n :

∣∣PXn(xn)− PXn(xn)∣∣ > √ε · PXn(xn)

= PXn

xn ∈ X n :

∣∣PXn(xn)− PXn(xn)∣∣ > √ε · PXn(xn)

=

∑xn∈Xn : PXn (xn)<(1/

√ε)|PXn (xn)−P

Xn(xn)|

PXn(xn)

≤∑xn∈Xn

1√ε|PXn(xn)− PXn(xn)|

≤ ε√ε

=√ε.

Consider that

PXn(D1 ∩ D0) ≥ PXn(D0)− PXn(Dc1)

≥ α−√ε > 0, (4.3.1)

which holds infinitely often in n; and every xn0 in D1 ∩ D0 satisfies

PXn(xn0 ) ≥ (1−√ε)PXn(xn0 )

and

− 1

nlogPXn(xn0 ) ≥ − 1

nlogPXn(xn0 ) +

1

nlog

1

1 +√ε

≥ (S(X) + δ) +1

nlog

1

1 +√ε

≥ S(X) +δ

2,

61

for n > (2/δ) log(1 +√ε). Therefore, for those n that (4.3.1) holds,

PXn

xn ∈ X n : − 1

nlogPXn(xn) >

1

nR(Xn)

≥ PXn

xn ∈ X n : − 1

nlogPXn(xn) > S(X) +

δ

2

≥ PXn(D1 ∩ D0)

≥ (1− ε1/2)PXn(D1 ∩ D0)

> 0,

which contradicts to the result of Lemma 4.15.

2. S(X) ≤ H(X).

It suffices to show the existence of X for arbitrary γ > 0 such that

limn→∞

‖Xn − Xn‖ = 0

and Xn is an M -type distribution with

M =⌊exp

n(H(X) + γ)

⌋.

Let Xn = Xn(Xn) be uniformly distributed over a set

G , Uj ∈ X n : j = 1, . . . ,M

which drawn randomly (independently) according to PXn . Define

D ,

xn ∈ X n : − 1

nlogPXn(xn) > H(X) + γ +

µ

n

.

For each G chosen, we obtain from Lemma 4.14 that

‖Xn − Xn‖

≤ 2µ+ 2 · PXn

(xn ∈ X n : log

PXn(xn)

PXn(xn)> µ

)= 2µ+ 2 · PXn

(xn ∈ G : log

1/M

PXn(xn)> µ

)(since PXn(Gc) = 0)

= 2µ+ PXn

xn ∈ G : − 1

nlogPXn(xn) > H(X) + γ +

µ

n

= 2µ+ PXn(G ∩ D)

= 2µ+1

M|G ∩ D| .

62

Since G is chosen randomly, we can take the expectation values (w.r.t. therandom G) of the above inequality to obtain:

EG

[‖Xn − Xn‖

]≤ 2µ+

1

MEG [|G ∩ D|] .

Observe that each Uj is either in D or not in D, and will contribute weight1/M when it is in D. From the i.i.d. assumption of UjMj=1, we can thenevaluate (1/M)EG[|G ∩ D|] by4

1

MEG[|G ∩ D|]

= PMXn [D] +

M − 1

MPM−1Xn [D]PXn [Dc] + · · ·+ 1

MPXn [D]PM−1

Xn [Dc]

=1

M

(MPM

Xn [D] + (M − 1)PM−1Xn [D]PXn [Dc]

+ · · ·+ PXn [D]PM−1Xn [Dc]

)=

1

M(MPXn [D])

= PXn [D].

Hence,

lim supn→∞

EG

[‖Xn − Xn‖

]≤ 2µ+ lim sup

n→∞PXn [D] = 2µ,

which implies

lim supn→∞

EG

[‖Xn − Xn‖

]= 0 (4.3.2)

since µ can be chosen arbitrarily small. (4.3.2) therefore guarantees theexistence of the desired X.

2

Theorem 4.17 For any X,

S(X) = lim supn→∞

1

nH(Xn).

Proof:

4Readers may imagine that there are cases: where |G ∩ D| = M , |G ∩ D| = M − 1, . . .,|G∩D| = 1 and |G∩D| = 0, respectively with drawing probability PMXn(D), PM−1Xn (D)PXn(Dc),. . ., PXn(D)PM−1Xn (Dc) and PMXn(Dc), and with expectation quantity (i.e., (1/M)|G ∩D|) beingM/M , (M − 1)/M , . . ., 1/M and 0.

63

1. S(X) ≤ lim supn→∞(1/n)H(Xn).

It suffices to prove that Sε(X) ≤ lim supn→∞(1/n)H(Xn) for every ε > 0.This is equivalent to show that for all γ > 0, there exists X such that

1

nH(Xn) < lim sup

n→∞

1

nH(Xn) + γ

and‖Xn − Xn‖ < ε

for sufficiently large n. This can be trivially achieved by letting X = X,since for sufficiently many n,

1

nH(Xn) < lim sup

n→∞

1

nH(Xn) + γ

and‖Xn −Xn‖ = 0.

2. S(X) ≥ lim supn→∞(1/n)H(Xn).

Observe that S(X) ≥ Sε(X) for any 0 < ε < 1/2. Then for any γ > 0

and all sufficiently large n, there exists Xn such that

1

nH(Xn) < S(X) + γ (4.3.3)

and‖Xn − Xn‖ < ε.

Using the fact [1, pp. 33] that ‖Xn − Xn‖ ≤ ε ≤ 1/2 implies

|H(Xn)−H(Xn)| ≤ ε log|X |n

ε,

and (4.3.3), we obtain

1

nH(Xn)− ε log |X |+ 1

nε log ε ≤ 1

nH(Xn) < S(X) + γ,

which implies that

lim supn→∞

1

nH(Xn)− ε log |X | < S(X) + γ.

Since ε and γ can be taken arbitrarily small, we have

S(X) ≥ lim supn→∞

1

nH(Xn).

2

64

4.4 Resolvability and source coding

In the previous chapter, we have proved that the lossless data compression ratefor block codes is lower bounded by H(X). We also show in Section 4.3 thatH(X) is also the resolvability for source X. We can therefore conclude thatresolvability is equal to the minimum lossless data compression rate for blockcodes. We will justify their equivalence directly in this section.

As explained in AEP theorem for memoryless source, the set Fn(δ) containsapproximately enH(X) elements, and the probability for source sequences beingoutside Fn(δ) will eventually goes to 0. Therefore, we can binary-index thosesource sequences in Fn(δ) by

log(enH(X)

)= nH(X) nats,

and encode the source sequences outside Fn(δ) by a unique default binary code-word, which results in an asymptotically zero probability of decoding error. Thisis indeed the main idea for Shannon’s source coding theorem for block codes.

By further exploring the above concept, we found that the key is actually theexistence of a set An = xn1 , xn2 , . . . , xnM with M ≈ enH(X) and PXn(Acn) → 0.Thus, if we can find such typical set, the Shannon’s source coding theorem forblock codes can actually be generalized to more general sources, such as non-stationary sources. Furthermore, extension of the theorems to codes of somespecific types becomes feasible.

Definition 4.18 (minimum ε-source compression rate for fixed-lengthcodes) R is the ε-source compression rate for fixed-length codes if there existsa sequence of sets An∞n=1 with An ⊂ X n such that

lim supn→∞

1

nlog |An| ≤ R and lim sup

n→∞PXn [Acn] ≤ ε.

Tε(X) is the minimum of all such rates.

Note that the definition of Tε(X) is equivalent to the one in Definition 3.2.

Definition 4.19 (minimum source compression rate for fixed-lengthcodes) T (X) represents the minimum source compression rate for fixed-lengthcodes, which is defined as:

T (X) , limε↓0

Tε(X).

65

Definition 4.20 (minimum source compression rate for variable-lengthcodes) R is an achievable source compression rate for variable-length codes ifthere exists a sequence of error-free prefix codes C∼n∞n=1 such that

lim supn→∞

1

n`n ≤ R

where `n is the average codeword length of C∼n. T (X) is the minimum of allsuch rates.

Recall that for a single source, the measure of its uncertainty is entropy.Although the entropy can also be used to characterize the overall uncertainty ofa random sequenceX, the source coding however concerns more on the “average”entropy of it. So far, we have seen four expressions of “average” entropy:

lim supn→∞

1

nH(Xn) , lim sup

n→∞

1

n

∑xn∈Xn

−PXn(xn) logPXn(xn);

lim infn→∞

1

nH(Xn) , lim inf

n→∞

1

n

∑xn∈Xn

−PXn(xn) logPXn(xn);

H(X) , infβ∈<

β : lim sup

n→∞PXn

[− 1

nlogPXn(Xn) > β

]= 0

;

H(X) , supα∈<

α : lim sup

n→∞PXn

[− 1

nlogPXn(Xn) < α

]= 0

.

If

limn→∞

1

nH(Xn) = lim sup

n→∞

1

nH(Xn) = lim inf

n→∞

1

nH(Xn),

then limn→∞(1/n)H(Xn) is named the entropy rate of the source. H(X) andH(X) are called the sup-entropy rate and inf-entropy rate, which were introducedin Section 2.3.

Next we will prove that T (X) = S(X) = H(X) and T (X) = S(X) =lim supn→∞(1/n)H(Xn) for a source X. The operational characterization oflim infn→∞(1/n)H(Xn) andH(X) will be introduced in Chapter 6.

Theorem 4.21 (equality of resolvability and minimum source codingrate for fixed-length codes)

T (X) = S(X) = H(X).

Proof: Equality of S(X) and H(X) is already given in Theorem 4.16. Also,T (X) = H(X) can be obtained from Theorem 3.5 by letting ε = 0. Here, weprovide an alternative proof for T (X) = S(X).

66

1. T (X) ≤ S(X).

If we can show that, for any ε fixed, Tε(X) ≤ S2ε(X), then the proof iscompleted. This claim is proved as follows.

• By definition of S2ε(X), we know that for any γ > 0, there exists Xand N such that for n > N ,

1

nR(Xn) < S2ε(X) + γ and ‖Xn − Xn‖ < 2ε.

• Let An ,xn : PXn(xn) > 0

. Since (1/n)R(Xn) < S2ε(X) + γ,

|An| ≤ expR(Xn) < expn(S2ε(X) + γ).

Therefore,

lim supn→∞

1

nlog |An| ≤ S2ε(X) + γ.

• Also,

2ε > ‖Xn − Xn‖ = 2 supE⊂Xn

|PXn(E)− PXn(E)|

≥ 2|PXn(Acn)− PXn(Acn)|= 2PXn(Acn), (sincePXn(Acn) = 0).

Hence, lim supn→∞ PXn(Acn) ≤ ε.

• Since S2ε(X) + γ is just one of the rates that satisfy the conditionsof the minimum ε-source compression rate, and Tε(X) is the smallestone of such rates,

Tε(X) ≤ S2ε(X) + γ for any γ > 0.

2. T (X) ≥ S(X).

Similarly, if we can show that, for any ε fixed, Tε(X) ≥ S3ε(X), then theproof is completed. This claim can be proved as follows.

• Fix α > 0. By definition of Tε(X), we know that for any γ > 0, thereexists N and a sequence of sets An∞n=1 such that for n > N ,

1

nlog |An| < Tε(X) + γ and PXn(Acn) < ε+ α.

67

• Choose Mn to satisfy

expn(Tε(X) + 2γ) ≤Mn ≤ expn(Tε(X) + 3γ).

Also select one element xn0 from Acn. Define a new random variable

Xn as follows:

PXn(xn) =

0, if xn 6∈ xn0 ∪ An;k(xn)

Mn

, if xn ∈ xn0 ∪ An,

where

k(xn) ,

dMnPXn(xn)e, if xn ∈ An;

Mn −∑xn∈An

k(xn), if xn = xn0 .

It can then be easily verified that Xn satisfies the next four properties:

(a) Xn is Mn-type;

(b) PXn(xn0 ) ≤ PXn(Acn) < ε+ α, since xn0 ∈ Acn;

(c) for all xn ∈ An,

∣∣PXn(xn)− PXn(xn)∣∣ =dMnPXn(xn)e

Mn

− PXn(xn) ≤ 1

Mn

.

(d) PXn(An) + PXn(xn0 ) = 1.

• Consequently,1

nR(Xn) ≤ Tε(X) + 3γ,

68

and

‖Xn − Xn‖ =∑xn∈An

∣∣PXn(xn)− PXn(xn)∣∣+∣∣PXn(xn0 )− PXn(xn0 )

∣∣+

∑xn∈Acn−xn0

∣∣PXn(xn)− PXn(xn)∣∣

≤∑xn∈An

∣∣PXn(xn)− PXn(xn)∣∣+[PXn(xn0 ) + PXn(xn0 )

]+

∑xn∈Acn−xn0

∣∣PXn(xn)− PXn(xn)∣∣

≤∑xn∈An

1

Mn

+ PXn(xn0 ) + PXn(xn0 )

+∑

xn∈Acn−xn0

PXn(xn)

=|An|Mn

+ PXn(xn0 ) +∑xn∈Acn

PXn(xn)

≤ expn(Tε(X) + γ)expn(Tε(X) + 2γ)

+ (ε+ α) + PXn(Acn)

≤ e−nγ + (ε+ α) + (ε+ α)

≤ 3(ε+ α), for n ≥ − log(ε+ α)/γ.

• Since Tε(X) is just one of the rates that satisfy the conditions of 3(ε+α)-resolvability, and S3(ε+α)(X) is the smallest one of such quantities,

S3(ε+α)(X) ≤ Tε(X).

The proof is completed by noting that α can be made arbitrarilysmall.

2

This theorem tells us that the minimum source compression ratio for fixed-length code is the resolvability, which in turn is equal to the sup-entropy rate.

Theorem 4.22 (equality of mean-resolvability and minimum sourcecoding rate for variable-length codes)

T (X) = S(X) = lim supn→∞

1

nH(Xn).

Proof: Equality of S(X) and lim supn→∞(1/n)H(Xn) is already given in The-orem 4.17.

69

1. S(X) ≤ T (X).

Definition 4.20 states that there exists, for all γ > 0 and all sufficientlylarge n, an error-free variable-length code whose average codeword length`n satisfies

1

n`n < T (X) + γ.

Moreover, the fundamental source coding lower bound for a uniquely de-codable code (cf. Theorem 4.18 of Volume I of the lecture notes) is

H(Xn) ≤ `n.

Thus, by letting X = X, we obtain ‖Xn − Xn‖ = 0 and

1

nH(Xn) =

1

nH(Xn) < T (X) + γ,

which concludes that T (X) is an ε-achievable mean-resolution rate of Xfor any ε > 0, i.e.,

S(X) = limε→0

Sε(X) ≤ T (X).

2. T (X) ≤ S(X).

Observe that Sε(X) ≤ S(X) for 0 < ε < 1/2. Hence, by taking γ satis-fying 2ε log |X | > γ > ε log |X | and for all sufficiently large n, there exists

Xn such that1

nH(Xn) < S(X) + γ

and‖Xn − Xn‖ < ε. (4.4.1)

On the other hand, Theorem 4.22 of Volume I of the lecture notes provesthe existence of an error-free prefix code for Xn with average codewordlength `n satisfies

`n ≤ H(Xn) + 1 (bits).

By the fact [1, pp. 33] that ‖Xn − Xn‖ ≤ ε ≤ 1/2 implies

|H(Xn)−H(Xn)| ≤ ε log2

|X |n

ε,

and (4.4.1), we obtain

1

n`n ≤

1

nH(Xn) +

1

n

≤ 1

nH(Xn) + ε log2 |X | −

1

nε log2 ε+

1

n

≤ S(X) + γ + ε log2 |X | −1

nε log2 ε+

1

n≤ S(X) + 2γ,

70

if n > (1− ε log2 ε)/(γ− ε log2 |X |). Since γ can be made arbitrarily small,S(X) is an achievable source compression rate for variable-length codes;and hence,

T (X) ≤ S(X).

2

Again, the above theorem tells us that the minimum source compression ratiofor variable-length code is the mean-resolvability, and the mean-resolvability isexactly lim supn→∞(1/n)H(Xn).

Note that lim supn→∞(1/n)H(Xn) ≤ H(X), which follows straightforwardlyby the fact that the mean of the random variable −(1/n) logPXn(Xn) is nogreater than its right margin of the support. Also note that for stationary-ergodic source, all these quantities are equal, i.e.,

T (X) = S(X) = H(X) = T (X) = S(X) = lim supn→∞

1

nH(Xn).

We end this chapter by computing these quantities for a specific example.

Example 4.23 Consider a binary random source X1, X2, . . . where Xi∞i=1 areindependent random variables with individual distribution

PXi(0) = Zi and PXi(1) = 1− Zi,

where Zi∞i=1 are pair-wise independent with common uniform marginal distri-bution over (0, 1).

You may imagine that the source is formed by selecting from infinitely manybinary number generators as shown in Figure 4.1. The selecting process Z isindependent for each time instance.

It can be shown that such source is not stationary. Nevertheless, by meansof similar argument as AEP theorem, we can show that:

− logPX(X1) + logPX(X2) + · · ·+ logPX(Xn)

n→ hb(Z) in probability,

where hb(a) , −a log2(a) − (1 − a) log2(1 − a) is the binary entropy function.To compute the ultimate average entropy rate in terms of the random variablehb(Z), it requires that

− logPX(X1) + logPX(X2) + · · ·+ logPX(Xn)

n→ hb(Z) in mean,

71

source Xt

t ∈ I

source Xt2

source Xt1

...

...

...

---

Selector

?

Z

-. . . , X2, X1

...

Source Generator

Figure 4.1: Source generator: Xtt∈I (I = (0, 1)) is an independentrandom process with PXt(0) = 1− PXt(1) = t, and is also independentof the selector Z, where Xt is outputted if Z = t. Source generator ofeach time instance is independent temporally.

which is a stronger result than convergence in probability. With the fundamentalproperties for convergence, convergence-in-probability implies convergence-in-mean provided the sequence of random variables is uniformly integrable, which

72

is true for −(1/n)∑n

i=1 logPX(Xi):

supn>0

E

[∣∣∣∣∣ 1nn∑i=1

logPX(Xi)

∣∣∣∣∣]

≤ supn>0

1

n

n∑i=1

E [|logPX(Xi)|]

= supn>0

E [|logPX(X)|] , because of i.i.d. of Xini=1

= E [|logPX(X)|]

= E

[E

(|logPX(X)|

∣∣∣∣Z)]=

∫ 1

0

E

(|logPX(X)|

∣∣∣∣Z = z

)dz

=

∫ 1

0

(z| log(z)|+ (1− z)| log(1− z)|

)dz

≤∫ 1

0

log(2)dz = log(2).

We therefore have: ∣∣∣∣E [− 1

nlogPXn(Xn)

]− E[hb(Z)]

∣∣∣∣≤ E

[∣∣∣∣− 1

nlogPXn(Xn)− hb(Z)

∣∣∣∣]→ 0.

Consequently,

lim supn→∞

1

nH(Xn) = E[hb(Z)] = 0.5 nats or 0.721348 bits.

However, it can be shown that the ultimate cumulative distribution function of−(1/n) logPXn(Xn) is Pr[hb(Z) ≤ t] for t ∈ [0, log(2)] (cf. Figure 4.2).

The sup-entropy rate of X should be log(2) nats or 1 bit (which is theright-margin of the ultimate CDF of −(1/n) logPXn(Xn)). Hence, for this un-stationary source, the minimum average codeword length for fixed-length codesand variable-length codes are different, which are 0.859912 bit and 1 bit, respec-tively.

73

0

1

0 log(2) nats

Figure 4.2: The ultimate CDF of −(1/n) logPXn(Xn): Prhb(Z) ≤ t.

74

Bibliography

[1] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems, Academic, New York, 1981.


[3] D. E. Knuth and A. C. Yao, “The complexity of random number genera-tion,” in Proceedings of Symposium on New Directions and Recent Resultsin Algorithms and Complexity. New York: Academic Press, 1976.

75

Chapter 5

Channel Coding Theorems andApproximations of Output Statistics forArbitrary Channels

The Shannon channel capacity in Volume I of the lecture notes is derived underthe assumption that the channel is memoryless. With moderate modification ofthe proof, this result can be extended to stationary-ergodic channels for whichthe capacity formula becomes the maximization of the mutual information rate:

limn→∞

supXn

1

nI(Xn;Y n).

Yet, for more general channels, such as non-stationary or non-ergodic channels,a more general expression for channel capacity needs to be derived.

5.1 General models for channels

The channel transition probability in its most general form is denoted by W n =PY n|Xn∞n=1, which is abbreviated by W for convenience. Similarly, the inputand output random processes are respectively denoted by X and Y .

5.2 Variations of capacity formulas for arbitrary channels

5.2.1 Preliminaries

Now, similar to the definitions of sup- and inf- entropy rates for sequence ofsources, the sup- and inf- (mutual-)information rates are respectively defined

76

by1

I(X;Y ) , supθ : i(θ) < 1

andI(X;Y ) , supθ : i(θ) ≤ 0,

where

i(θ) , lim infn→∞

Pr

1


is the inf-spectrum of the normalized information density,


Pr

1


is the sup-spectrum of the normalized information density, and

iXnWn(xn; yn) , logPY n|Xn(yn|xn)

PY n(yn)

is the information density.

In 1994, Verdu and Han [2] have shown that the channel capacity in its mostgeneral form is

C , supX

I(X;Y ).

In their proof, they showed the achievability part in terms of Feinstein’s Lemma,and provide a new proof for the converse part. In this section, we will not adoptthe original proof of Verdu and Han in the converse theorem. Instead, we willuse a new and tighter bound [1] established by Poor and Verdu in 1995.

Definition 5.1 (fixed-length data transmission code) An (n,M) fixed-length data transmission code for channel input alphabet X n and output alpha-bet Yn consists of

1. M informational messages intended for transmission;

1In the paper of Verdu and Han [2], these two quantities are defined by:

I(X;Y ) , infβ∈<

β : (∀ γ > 0) lim sup

n→∞PXnWn

[1

niXnWn(Xn;Y n) > β + γ

]= 0

and

I(X;Y ) , supα∈<

α : (∀ γ > 0) lim sup

n→∞PXnWn

[1

niXnWn(Xn;Y n) < α+ γ

]= 0

.

The above definitions are in fact equivalent to ours.

77

2. an encoding function

f : 1, 2, . . . ,M → X n;

3. a decoding functiong : Yn → 1, 2, . . . ,M,

which is (usually) a deterministic rule that assigns a guess to each possiblereceived vector.

The channel inputs in xn ∈ X n : xn = f(m) for some 1 ≤ m ≤ M are thecodewords of the data transmission code.

Definition 5.2 (average probability of error) The average probability oferror for a C∼n = (n,M) code with encoder f(·) and decoder g(·) transmittedover channel QY n|Xn is defined as

Pe( C∼n) =1

M

M∑i=1

λi,

whereλi ,

∑yn∈Yn : g(yn)6=i

QY n|Xn(yn|f(i)).

Under the criterion of average probability of error, all of the codewords aretreated equally, namely the prior probability of the selected M codewords areuniformly distributed.

Lemma 5.3 (Feinstein’s Lemma) Fix a positive n. For every γ > 0 andinput distribution PXn on X n, there exists an (n,M) block code for the transitionprobability PWn = PY n|Xn that its average error probability Pe( C∼n) satisfies

Pe( C∼n) < Pr

[1

niXnWn(Xn;Y n) <

1

nlogM + γ

]+ e−nγ.

Proof:

Step 1: Notations. Define

G ,

(xn, yn) ∈ X n × Yn :

1

niXnWn(xn; yn) ≥ 1

nlogM + γ

.

Let ν , e−nγ + PXnWn(Gc).

78

The Feinstein’s Lemma obviously holds if ν ≥ 1, because then

Pe( C∼n) ≤ 1 ≤ ν , Pr

[1

niXnWn(Xn;Y n) <

1

nlogM + γ

]+ e−nγ.

So we assume ν < 1, which immediately results in

PXnWn(Gc) < ν < 1,

or equivalently,PXnWn(G) > 1− ν > 0.

Therefore,

PXn(A , xn ∈ X n : PY n|Xn(Gxn|xn) > 1− ν) > 0,

where Gxn , yn ∈ Yn : (xn, yn) ∈ G, because if PXn(A) = 0,

(∀ xn with PXn(xn) > 0) PY n|Xn(Gxn|xn) ≤ 1− ν

⇒∑xn∈Xn

PXn(xn)PY n|Xn(Gxn|xn) = PXnWn(G) ≤ 1− ν.

Step 2: Encoder. Choose an xn1 in A (Recall that PXn(A) > 0.) Define Γ1 =Gxn1 . (Then PY n|Xn(Γ1|xn1 ) > 1− ν.)

Next choose, if possible, a point xn2 ∈ X n without replacement (i.e., xn2cannot be identical to xn1 ) for which

PY n|Xn

(Gxn2 − Γ1

∣∣xn2) > 1− ν,

and define Γ2 , Gxn2 − Γ1.

Continue in the following way as for codeword i: choose xni to satisfy

PY n|Xn

(Gxni −

i−1⋃j=1

Γj

∣∣∣∣∣xni)> 1− ν,

and define Γi , Gxni −⋃i−1j=1 Γj.

Repeat the above codeword selecting procedure until either M codewordshave been selected or all the points in A have been exhausted.

Step 3: Decoder. Define the decoding rule as

φ(yn) =

i, if yn ∈ Γiarbitraryy, otherwise.

79

Step 4: Probability of error. For all selected codewords, the error probabi-lity given codeword i is transmitted, λe|i, satisfies

λe|i ≤ PY n|Xn(Γci |xni ) < ν.

(Note that (∀ i) PXn|Xn(Γi|xni ) ≥ 1 − ν by step 2.) Therefore, if we canshow that the above codeword selecting procedures will not terminatedbefore M , then

Pe( C∼n) =1

M

M∑i=1

λe|i < ν.

Step 5: Claim. The codeword selecting procedure in step 2 will not terminatedbefore M .

proof: We will prove it by contradiction.

Suppose the above procedure terminated before M , say at N < M . Definethe set

F ,N⋃i=1

Γi ∈ Yn.

Consider the probability

PXnWn(G) = PXnWn [G ∩ (X n ×F)] + PXnWn [G ∩ (X n ×F c)]. (5.2.1)

Since for any yn ∈ Gxni ,

PY n(yn) ≤PY n|Xn(yn|xni )

M · enγ,

we have

PY n(Γi) ≤ PY n(Gxni )

≤ 1

Me−nγPY n|Xn(Gxni )

≤ 1

Me−nγ.

So the first term of the right hand side in (5.2.1) can be upper bounded by

PXnWn [G ∩ (X n ×F)] ≤ PXnWn(X n ×F)

= PY n(F)

=N∑i=1

PY n(Γi)

≤ N × 1

Me−nγ =

N

Me−nγ.

80

As for the second term of the right hand side in (5.2.1), we can upperbound it by

PXnWn [G ∩ (X n ×F c)] =∑xn∈Xn

PXn(xn)PY n|Xn(Gxn ∩ F c|xn)

=∑xn∈Xn

PXn(xn)PY n|Xn

(Gxn −

N⋃i=1

Γi

∣∣∣∣∣xn)

≤∑xn∈Xn

PXn(xn)(1− ν) ≤ (1− ν),

where the last step follows since for all xn ∈ X n,

PY n|Xn

(Gxn −

N⋃i=1

Γi

∣∣∣∣∣xn)≤ 1− ν.

(Because otherwise we could find the (N + 1)-th codeword.)

Consequently, PXnWn(G) ≤ (N/M)e−nγ + (1− ν). By definition of G,

PXnWn(G) = 1− ν + e−nγ ≤ N

Me−nγ + (1− ν),

which implies N ≥M , a contradiction. 2

Lemma 5.4 (Poor-Verdu Lemma [1]) Suppose X and Y are random vari-ables, where X taking values on a finite (or coutably infinite) set

X = x1, x2, x3, . . . .

The minimum probability of error Pe in estimating X from Y satisfies

Pe ≥ (1− α)PX,Y

(x, y) ∈ X × Y : PX|Y (x|y) ≤ α

for each α ∈ [0, 1].

Proof: It is known that the minimum-error-probability estimate e(y) of X whenreceiving y is

e(y) = arg maxx∈X

PX|Y (x|y). (5.2.2)

Therefore, the error probability incurred in testing among the values of X is

81

given by

1− Pe = PrX = e(Y )

=

∫Y

[ ∑x : x=e(y)

PX|Y (x|y)

]dPY (y)

=

∫Y

(maxx∈X

PX|Y (x|y)

)dPY (y)

=

∫Y

(maxx∈X

fx(y)

)dPY (y)

= E

[maxx∈X

fx(Y )

],

where fx(y) , PX|Y (x|y). Let hj(y) be the j-th element in the re-ordering setof fx1(y), fx2(y), fx3(y), . . . according to ascending element values. In otherwords,

h1(y) ≥ h2(y) ≥ h3(y) ≥ · · ·

andh1(y), h2(y), h3(y), . . . = fx1(y), fx2(y), fx3(y), . . ..

Then1− Pe = E[h1(Y )]. (5.2.3)

For any α ∈ [0, 1], we can write

PX,Y (x, y) ∈ X × Y : fx(y) > α =

∫YPX|Y x ∈ X : fx(y) > α dPY (y).

Observe that

PX|Y x ∈ X : fx(y) > α =∑x∈X

PX|Y (x|y) · 1[fx(y) > α]

=∑x∈X

fx(y) · 1[fx(y) > α]

=∞∑j=1

hj(y) · 1[hj(y) > α],

82

where 1(·) is the indicator function2. Therefore,

PXY (x, y) ∈ X × Y : fx(y) > α =

∫Y

(∞∑j=1

hj(y) · 1[hj(y) > α]

)dPY (y)

≥∫Yh1(y) · 1(h1(y) > α)dPY (y)

= E[h1(Y ) · 1(h1(Y ) > α)].

It remains to relate E[h1(Y ) · 1(h1(Y ) > α)] with E[h1(Y )], which is exactly1− Pe. For any α ∈ [0, 1] and any random variable U with Pr0 ≤ U ≤ 1 = 1,

U ≤ α + (1− α) · U · 1(U > α)

holds with probability one. This can be easily proved by upper-bounding U interms of α when 0 ≤ U ≤ α, and α + (1− α)U , otherwise. Thus

E[U ] ≤ α + (1− α)E[U · 1(U > α)].

By letting U = h1(Y ), together with (5.2.3), we finally obtain

(1− α)PXY (x, y) ∈ X × Y : fx(y) > α ≥ (1− α)E[h1(Y ) · 1[h1(Y ) > α]]

≥ E[h1(Y )]− α= (1− Pe)− α= (1− α)− Pe.

2

Corollary 5.5 Every C∼n = (n,M) code satisfies

Pe( C∼n) ≥(1− e−nγ

)Pr

[1

niXnWn(Xn;Y n) ≤ 1

nlogM − γ

]for every γ > 0, where Xn places probability mass 1/M on each codeword, andPe( C∼n) denotes the error probability of the code.

Proof: Taking α = e−nγ in Lemma 5.4, and replacing X and Y in Lemma 5.4

2I.e., if fx(y) > α is true, 1(·) = 1; else, it is zero.

83

by its n-fold counterparts, i.e., Xn and Y n, we obtain

Pe( C∼n) ≥(1− e−nγ

)PXnWn

[(xn, yn) ∈ X n × Yn : PXn|Y n(xn|yn) ≤ e−nγ

]=

(1− e−nγ

)PXnWn

[(xn, yn) ∈ X n × Yn :

PXn|Y n(xn|yn)

1/M≤ e−nγ

1/M

]=

(1− e−nγ

)PXnWn

[(xn, yn) ∈ X n × Yn :

PXn|Y n(xn|yn)

PXn(xn)≤ e−nγ

1/M

]=

(1− e−nγ

)PXnWn [(xn, yn) ∈ X n × Yn :

1

nlog

PXn|Y n(xn|yn)

PXn(xn)≤ 1

nlogM − γ

]=

(1− e−nγ

)Pr

[1

niXnWn(Xn;Y n) ≤ 1

nlogM − γ

].

2

Definition 5.6 (ε-achievable rate) Fix ε ∈ [0, 1]. R ≥ 0 is an ε-achievablerate if there exists a sequence of C∼n = (n,Mn) channel block codes such that

lim infn→∞

1

nlogMn ≥ R

andlim supn→∞

Pe( C∼n) ≤ ε.

Definition 5.7 (ε-capacity Cε) Fix ε ∈ [0, 1]. The supremum of ε-achievablerates is called the ε-capacity, Cε.

It is straightforward for the definition that Cε is non-decreasing in ε, andC1 = log |X |.

Definition 5.8 (capacity C) The channel capacity C is defined as the supre-mum of the rates that are ε-achievable for all ε ∈ [0, 1]. It follows immediatelyfrom the definition3 that C = inf0≤ε≤1Cε = limε↓0Cε = C0 and that C is the

3The proof of C0 = limε↓0 Cε can be proved by contradiction as follows.Suppose C0 + 2γ < limε↓0 Cε for some γ > 0. For any positive integer j, and by defini-

tion of C1/j , there exists Nj and a sequence of C∼n = (n,Mn) code such that for n > Nj ,

(1/n) logMn > C1/j − γ > C0 + γ and Pe( C∼n) < 2/j. Construct a sequence of codes C∼n =

(n, Mn) as: C∼n = C∼n, if maxj−1i=1 Ni ≤ n < maxji=1Ni. Then lim supn→∞(1/n) log Mn ≥ C0+γ

and lim infn→∞ Pe( C∼n) = 0, which contradicts to the definition of C0.

84

supremum of all the rates R for which there exists a sequence of C∼n = (n,Mn)channel block codes such that

lim infn→∞

1

nlogMn ≥ R,

andlim supn→∞

Pe( C∼n) = 0.

5.2.2 ε-capacity

Theorem 5.9 (ε-capacity) For 0 < ε < 1, the ε-capacity Cε for arbitrarychannels satisfies

Cε = supX

Iε(X;Y ).

Proof:

1. Cε ≥ supX Iε(X;Y ).

Fix input X. It suffices to show the existence of C∼n = (n,Mn) datatransmission code with rate

Iε(X;Y )− γ < 1

nlogMn < Iε(X;Y )− γ

2

and probability of decoding error satisfying

lim supn→∞

Pe( C∼n) ≤ ε

for every γ > 0. (Because if such code exists, then lim supn→∞ Pe( C∼n) ≤ εand lim infn→∞(1/n) logMn ≥ Iε(X;Y )−γ, which implies Cε ≥ Iε(X;Y ).)

From Lemma 5.3, there exists an C∼n = (n,Mn) code whose error probabi-lity satisfies

Pe( C∼n) < Pr

[1

niXnWn(Xn;Y n) <

1

nlogMn +

γ

4

]+ e−nγ/4

≤ Pr

[1

niXnWn(Xn;Y n) <

(Iε(X;Y )− γ

2

)+γ

4

]+ e−nγ/4

≤ Pr

[1

niXnWn(Xn;Y n) < Iε(X;Y )− γ

4

]+ e−nγ/4.

Since

Iε(X;Y ) , sup

R : lim sup

n→∞Pr

[1

niWnWn(Xn;Y n) ≤ R

]≤ ε

,

85

we obtain

lim supn→∞

Pr

[1


4

]≤ ε.

Hence, the proof of the direct part is completed by noting that

lim supn→∞

Pe( C∼n) ≤ lim supn→∞

Pr

[1


4

]+ lim sup

n→∞e−nγ/4

= ε.

2. Cε ≤ supX Iε(X;Y ).

Suppose that there exists a sequence of C∼n = (n,Mn) codes with ratestrictly larger than supX Iε(X;Y ) and lim supn→∞ Pe( C∼n) ≤ ε. Let theultimate code rate for this code be supX Iε(X;Y ) + 3ρ for some ρ > 0.Then for sufficiently large n,

1

nlogMn > sup

XIε(X;Y ) + 2ρ.

Since the above inequality holds for every X, it certainly holds if takinginput Xn which places probability mass 1/Mn on each codeword, i.e.,

1

nlogMn > Iε(X; Y ) + 2ρ, (5.2.4)

where Y is the channel output due to channel input X. Then from Corol-lary 5.5, the error probability of the code satisfies

Pe( C∼n) ≥(1− e−nρ

)Pr

[1

niXnWn(Xn; Y n) ≤ 1

nlogMn − ρ

]≥

(1− e−nρ

)Pr

[1

niXnWn(Xn; Y n) ≤ Iε(X; Y ) + ρ

],

where the last inequality follows from (5.2.4), which by taking the limsupof both sides, we have

ε ≥ lim supn→∞

Pe( C∼n) ≥ lim supn→∞

Pr

[1

niXnWn(Xn; Y n) ≤ Iε(X; Y ) + ρ

]> ε,

and a desired contradiction is obtained. 2

86

5.2.3 General Shannon capacity

Theorem 5.10 (general Shannon capacity) The channel capacity C for ar-bitrary channel satisfies

C = supX

I(X;Y ).

Proof: Observe thatC = C0 = lim

ε↓0Cε.

Hence, from Theorem 5.9, we note that for ε ∈ (0, 1),

C ≥ supX

limδ↑ε

Iδ(X;Y ) ≥ supX

I0(X;Y ) = supX

I(X;Y ).

It remains to show that C ≤ supX I(X;Y ).

Suppose that there exists a sequence of C∼n = (n,Mn) codes with rate strictlylarger than supX I(X;Y ) and error probability tends to 0 as n → ∞. Let theultimate code rate for this code be supX I(X;Y ) + 3ρ for some ρ > 0. Thenfor sufficiently large n,

1

nlogMn > sup

XI(X;Y ) + 2ρ.

Since the above inequality holds for every X, it certainly holds if taking inputXn which places probability mass 1/Mn on each codeword, i.e.,

1

nlogMn > I(X; Y ) + 2ρ, (5.2.5)

where Y is the channel output due to channel input X. Then from Corollary5.5, the error probability of the code satisfies

Pe( C∼n) ≥(1− e−nρ

)Pr

[1

niXnWn(Xn; Y n) ≤ 1

nlogMn − ρ

]≥

(1− e−nρ

)Pr

[1

niXnWn(Xn; Y n) ≤ I(X; Y ) + ρ

], (5.2.6)

where the last inequality follows from (5.2.5). Since, by assumption, Pe( C∼n)vanishes as n→∞, but (5.2.6) cannot vanish by definition of I(X; Y ), therebya desired contradiction is obtained. 2

87

5.2.4 Strong capacity

Definition 5.11 (strong capacity) Define the strong converse capacity (orstrong capacity) CSC as the infimum of the rates R such that for all channelblock codes C∼n = (n,Mn) with

lim infn→∞

1

nlogMn ≥ R,

we havelim infn→∞

Pe( C∼n) = 1.

Theorem 5.12 (general strong capacity)

CSC , supXI(X;Y ).

5.2.5 Examples

With these general capacity formulas, we can now compute the channel capac-ity for some of the non-stationary or non-ergodic channels, and analyze theirproperties.

Example 5.13 (Shannon capacity) Let the input and output alphabets be0, 1, and let every output Yi be given by:

Yi = Xi ⊕Ni,

where “⊕” represents modulo-2 addition operation. Assume the input processX and the noise process N are independent.

A general relation between the entropy rate and mutual information rate canbe derived from (2.4.2) and (2.4.4) as:

H(Y )− H(Y |X) ≤ I(X;Y ) ≤ H(Y )− H(Y |X).

Since Nn is completely determined from Y n under the knowledge of Xn,

H(Y |X) = H(N ).

Indeed, this channel is a symmetric channel. Therefore, an uniform input yieldsuniform output (Bernoulli with parameter (1/2)), andH(Y ) = H(Y ) = log(2)nats. We thus have

C = log(2)− H(N ) nats.

We then compute the channel capacity for the next two cases of noises.

88

Case A) If N is a non-stationary binary independent sequence with

PrNi = 1 = pi,

then by the uniform boundedness (in i) of the variance of random variable− logPNi(Ni), namely,

Var[− logPNi(Ni)] ≤ E[(logPNi(Ni))2]

≤ sup0<pi<1

[pi(log pi)

2 + (1− pi)(log(1− pi))2]

≤ log(2),

we have by Chebyshev’s inequality,

Pr

∣∣∣∣∣− 1

nlogPNn(Nn)− 1

n

n∑i=1

H(Ni)

∣∣∣∣∣ > γ

→ 0,

for any γ > 0. Therefore, H(N ) = lim supn→∞(1/n)∑n

i=1 hb(pi). Conse-quently,

C = log(2)− lim supn→∞

1

n

n∑i=1

hb(pi) nats/channel usage.

This result is illustrated in Figures 5.1.

-

· · ·

H(N ) H(N )cluster points

Figure 5.1: The ultimate CDFs of (1/n) logPNn(Nn).

Case B) If N has the same distribution as the source process in Example 4.23,then H(N ) = log(2) nats, which yields zero channel capacity.

Example 5.14 (strong capacity) Continue from Example 5.13. To show theinf-information rate of channel W , we first derive the relation of the CDF be-tween the information density and noise process with respective to the uniform

89

input that maximizes the information rate (In this case, PXn(xn) = PY n(yn) =2−n).

Pr

1

nlog

PY n|Xn(Y n|Xn)

PY n(Y n)≤ θ

= Pr

1

nlogPNn(Nn)− 1

nlogPY n(Y n) ≤ θ

= Pr

1

nlogPNn(Nn) ≤ θ − log(2)

= Pr

− 1

nlogPNn(Nn) ≥ log(2)− θ

= 1− Pr

− 1

nlogPNn(Nn) < log(2)− θ

. (5.2.7)

Case A) From (5.2.7), it is obvious that

CSC = 1− lim infn→∞

1

n

n∑i=1

hb(pi).

Case B) From (5.2.7) and also from Figure 5.2,

CSC = log(2) nats/channel usage.

In Case B of Example 5.14, we have derived the ultimate CDF of the nor-malized information density, which is depicted in Figure 5.2. This limiting CDFis called spectrum of the normalized information density.

In Figure 5.2, it has been stated that the channel capacity is 0 nat/channnelusage, and the strong capacity is log(2) nat/channel usage. Hence, the opera-tional meaning of the two margins has been clearly revealed. Question is “whatis the operational meaning of the function value between 0 and log(2)?”. Theanswer of this question actually follows Definition 5.7 of a new capacity-relatedquantity.

In practice, it may not be easy to design a block code which transmits infor-mation with (asymptotically) no error through a very noisy channel with rateequals to channel capacity. However, if we admit some errors in transmission,such as the error probability is bounded above by 0.001, we may have morechance to come up with a feasible block code.

Example 5.15 (ε-capacity) Continue from case B of Example 5.13. Let thespectrum of the normalized information density be i(θ). Then the ε-capacity ofthis channel is actually the inverse function of i(θ), i.e.

Cε = i−1(θ).

90

-0 log(2) nats

1

Figure 5.2: The ultimate CDF of the normalized information densityfor Example 5.14-Case B).

Note that the Shannon capacity can be written as:

C = limε↓0

Cε;

and in general, the strong capacity satisfies

CSC ≥ limε↑1

Cε.

However, equality of the above inequality holds in this example.

5.3 Structures of good data transmission codes

The channel capacity for discrete memoryless channel is shown to be:

C , maxPX

I(PX , QY |X).

Let PX be the optimizer of the above maximization operation. Then

C , maxPX

I(PX , QY |X) = I(PX , QY |X).

Here, the performance of the code is assumed to be the average error probability,namely

Pe( C∼n) =1

M

M∑i=1

Pe( C∼n|xni ),

91

if the codebook is C∼n , xn1 , xn2 , . . . , xnM. Due to the random coding argument,a deterministic good code with arbitrarily small error probability and rate lessthan channel capacity must exist. Question is what is the relationship betweenthe good code and the optimizer PX? It is widely believed that if the code isgood (with rate close to capacity and low error probability), then the outputstatistics PY n— due to the equally-likely code—must approximate the outputdistribution, denoted by PY n , due to the input distribution achieving the channelcapacity.

This fact is actually reflected in the next theorem.

Theorem 5.16 For any channel W n = (Y n|Xn) with finite input alphabet andcapacity C that satisfies the strong converse (i.e., C = CSC), the followingstatement holds.

Fix γ > 0 and a sequence of C∼n = (n,Mn)∞n=1 block codes with

1

nlogMn ≥ C − γ/2,

and vanishing error probability (i.e., error probability approaches zero as block-length n tends to infinity.) Then,

1

n‖Y n − Y n‖ ≤ γ for all sufficiently large n,

where Y n is the output due to the block code and Yn

is the output due the Xn

that satisfiesI(X

n;Y

n) = max

XnI(Xn;Y n).

To be specific,

PY n(yn) =∑

xn∈ C∼n

PXn(xn)PWn(yn|xn) =∑

xn∈ C∼n

1

Mn

PWn(yn|xn)

andPY n(yn) =

∑xn∈Xn

PXn(xn)PWn(yn|xn).

Note that the above theorem holds for arbitrary channels, not restricted toonly discrete memoryless channels.

One may query that “can a result in the spirit of the above theorem beproved for the input statistics rather than the output statistics?” The answeris negative. Hence, the statement that the statistics of any good code mustapproximate those that maximize the mutual information is erroneously takenfor granted. (However, we do not rule out the possibility of the existence of

92

good codes that approximate those that maximize the mutual information.) To

see this, simply consider the normalized entropy of Xn

versus that of Xn (whichis uniformly distributed over the codewords) for discrete memoryless channels:

1

nH(X

n)− 1

nH(Xn) =

[1

nH(X

n|Y n) +

1

nI(X

n;Y

n)

]− 1

nlog(Mn)

=[H(X|Y ) + I(X;Y )

]− 1

nlog(Mn)

=[H(X|Y ) + C

]− 1

nlog(Mn).

A good code with vanishing error probability exists for (1/n) log(Mn) arbitrarilyclose to C; hence, we can find a good code sequence to satisfy

limn→∞

[1

nH(X

n)− 1

nH(Xn)

]= H(X|Y ).

Since the term H(X|Y ) is in general positive, where a quick example is the BSCwith crossover probability p, which yields H(X|Y ) = −p log(p)−(1−p) log(1−p),the two input distributions does not necessarily resemble to each other.

5.4 Approximations of output statistics: resolvability forchannels

5.4.1 Motivations

The discussion of the previous section somewhat motivates the necessity to finda equally-distributed (over a subset of input alphabet) input distribution thatgenerates the output statistics, which is close to the output due to the inputthat maximizes the mutual information. Since such approximations are usuallyperformed by computers, it may be natural to connect approximations of theinput and output statistics with the concept of resolvability.

5.4.2 Notations and definitions of resolvability for chan-nels

In a data transmission system as shown in Figure 5.3, suppose that the source,channel and output are respectively denoted by

Xn , (X1, . . . , Xn),

W n , (W1, . . . ,Wn),

93

andY n , (Y1, . . . , Yn),

where Wi has distribution PYi|Xi .

To simulate the behavior of the channel, a computer-generated input maybe necessary as shown in Figure 5.4. As stated in Chapter 4, such computer-generated input is based on an algorithm formed by a few basic uniform randomexperiments, which has finite resolution. Our goal is to find a good computer-generated input Xn such that the corresponding output Y n is very close to thetrue output Y n.

-. . . , X3, X2, X1

true source

PY n|Xn

true channel-

. . . , Y3, Y2, Y1

true output

Figure 5.3: The communication system.

-. . . , X3, X2, X1

computer-generatedsource

PY n|Xn

true channel-

. . . , Y3, Y2, Y1

correspondingoutput

Figure 5.4: The simulated communication system.

Definition 5.17 (ε-resolvability for input X and channel W ) Fix ε >0, and suppose that the (true) input random variable and (true) channel statisticsare X and W = (Y |X), respectively.

Then the ε-resolvability Sε(X,W ) for input X and channel W is definedby:

Sε(X,W ) , minR : (∀ γ > 0)(∃ X and N)(∀ n > N)

1

nR(Xn) < R + γ and ‖Y n − Y n‖ < ε

,

where PY n = PXnPWn . (The definitions of resolution R(·) and variational dis-tance ‖(·)− (·)‖ are given by Definitions 4.4 and 4.5.)

Note that if we take the channel W n to be an identity channel for all n,namely X n = Yn and PY n|Xn(yn|xn) is either 1 or 0, then the ε-resolvability forinput X and channel W is reduced to source ε-resolvability for X:

Sε(X,W Identity) = Sε(X).

94

Similar reductions can be applied to all the following definitions.

Definition 5.18 (ε-mean-resolvability for input X and channel W ) Fixε > 0, and suppose that the (true) input random variable and (true) channelstatistics are respectively X and W .

Then the ε-mean-resolvability Sε(X,W ) for input X and channel W isdefined by:

Sε(X,W ) , minR : (∀ γ > 0)(∃ X and N)(∀ n > N)

1

nH(Xn) < R + γ and ‖Y n − Y n‖ < ε

,

where PY n = PXnPWn and PY n = PXnPWn .

Definition 5.19 (resolvability and mean resolvability for input X andchannel W ) The resolvability and mean-resolvability for input X and W aredefined respectively as:

S(X,W ) , supε>0

Sε(X,W ) and S(X,W ) , supε>0

Sε(X,W ).

Definition 5.20 (resolvability and mean resolvability for channel W )The resolvability and mean-resolvability for channel W are defined respectivelyas:

S(W ) , supXS(X,W ),

andS(W ) , sup

XS(X,W ).

5.4.3 Results on resolvability and mean-resolvability forchannels

Theorem 5.21S(W ) = CSC = sup

XI(X;Y ).

It is somewhat a reasonable inference that if no computer algorithms can pro-duce a desired good output statistics under the number of random nats specified,then all codes should be bad codes for this rate.

Theorem 5.22S(W ) = C = sup

XI(X;Y ).

95

It it not yet clear that the operational relation between the resolvability (ormean-resolvability) and capacities for channels. This could be some interestingproblems to think of.

96

Bibliography

[1] H. V. Poor and S. Verdu, “A lower bound on the probability of error inmultihypothesis Testing,” IEEE Trans. on Information Theory, vol. IT–41,no. 6, pp. 1992–1994, Nov. 1995.


97

Chapter 6

Optimistic Shannon Coding Theoremsfor Arbitrary Single-User Systems

The conventional definitions of the source coding rate and channel capacity re-quire the existence of reliable codes for all sufficiently large blocklengths. Alterna-tively, if it is required that good codes exist for infinitely many blocklengths, thenoptimistic definitions of source coding rate and channel capacity are obtained.

In this chapter, formulas for the optimistic minimum achievable fixed-lengthsource coding rate and the minimum ε-achievable source coding rate for arbi-trary finite-alphabet sources are established. The expressions for the optimisticcapacity and the optimistic ε-capacity of arbitrary single-user channels are alsoprovided. The expressions of the optimistic source coding rate and capacity areexamined for the class of information stable sources and channels, respectively.Finally, examples for the computation of optimistic capacity are presented.

6.1 Motivations

The conventional definition of the minimum achievable fixed-length source cod-ing rate T (X) (or T0(X)) for a source X (cf. Definition 3.2) requires the exis-tence of reliable source codes for all sufficiently large blocklengths. Alternatively,if it is required that reliable codes exist for infinitely many blocklengths, a new,more optimistic definition of source coding rate (denoted by T (X)) is obtained[11]. Similarly, the optimistic capacity C is defined by requiring the existenceof reliable channel codes for infinitely many blocklengths, as opposed to thedefinition of the conventional channel capacity C [12, Definition 1].

This concept of optimistic source coding rate and capacity has recently beeninvestigated by Verdu et.al for arbitrary (not necessarily stationary, ergodic,information stable, etc.) sources and single-user channels [11, 12]. More specifi-cally, they establish an additional operational characterization for the optimistic

98

minimum achievable source coding rate (T (X) for source X) by demonstratingthat for a given channel, the classical statement of the source-channel separationtheorem1 holds for every channel if T (X) = T (X) [11]. In a dual fashion, theyalso show that for channels with C = C, the classical separation theorem holdsfor every source. They also conjecture that T (X) and C do not seem to admita simple expression.

In this chapter, we demonstrate that T (X) and C do indeed have a generalformula. The key to these results is the application of the generalized sup-information rate introduced in Chapter 2 to the existing proofs by Verdu andHan [12] of the direct and converse parts of the conventional coding theorems.

We also provide a general expression for the optimistic minimum ε-achievablesource coding rate and the optimistic ε-capacity.

For the generalized sup/inf-information/entropy rates which will play a keyrole in proving our optimistic coding theorems, readers may refer to Chapter 2.

6.2 Optimistic source coding theorems

In this section, we provide the optimistic source coding theorems. They areshown based on two new bounds due to Han [7] on the error probability ofa source code as a function of its size. Interestingly, these bounds constitutethe natural counterparts of the upper bound provided by Feinstein’s Lemmaand the Verdu-Han lower bound [12] to the error probability of a channel code.Furthermore, we show that for information stable sources, the formula for T (X)reduces to

T (X) = lim infn→∞

1

nH(Xn).

This is in contrast to the expression for T (X), which is known to be

T (X) = lim supn→∞

1

nH(Xn).

The above result leads us to observe that for sources that are both stationary andinformation stable, the classical separation theorem is valid for every channel.

In [11], Vembu et.al characterize the sources for which the classical separationtheorem holds for every channel. They demonstrate that for a given source X,

1By the “classical statement of the source-channel separation theorem,” we mean the fol-lowing. Given a source X with (conventional) source coding rate T (X) and channel Wwith capacity C, then X can be reliably transmitted over W if T (X) < C. Conversely, ifT (X) > C, then X cannot be reliably transmitted over W . By reliable transmissibility ofthe source over the channel, we mean that there exits a sequence of joint source-channel codessuch that the decoding error probability vanishes as the blocklength n→∞ (cf. [11]).

99

the separation theorem holds for every channel if its optimistic minimum achiev-able source coding rate (T (X)) coincides with its conventional (or pessimistic)minimum achievable source coding rate (T (X)); i.e., if T (X) = T (X).

We herein establish a general formula for T (X). We prove that for any sourceX,

T (X) = limδ↑1Hδ(X).

We also provide the general expression for the optimistic minimum ε-achievablesource coding rate. We show these results based on two new bounds due to Han(one upper bound and one lower bound) on the error probability of a sourcecode [7, Chapter 1]. The upper bound (Lemma 3.3) consists of the counterpartof Feinstein’s Lemma for channel codes, while the lower bound (Lemma 3.4)consists of the counterpart of the Verdu-Han lower bound on the error probabilityof a channel code ([12, Theorem 4]). As in the case of the channel coding bounds,both source coding bounds (Lemmas 3.3 and 3.4) hold for arbitrary sources andfor arbitrary fixed blocklength.

Definition 6.1 An (n,M) fixed-length source code for Xn is a collection of Mn-tuples C∼n = cn1 , . . . , cnM. The error probability of the code is

Pe( C∼n) , Pr [Xn 6∈ C∼n] .

Definition 6.2 (optimistic ε-achievable source coding rate) Fix 0 < ε <1. R ≥ 0 is an optimistic ε-achievable rate if, for every γ > 0, there exists asequence of (n,Mn) fixed-length source codes C∼n such that

lim supn→∞

1

nlogMn ≤ R

andlim infn→∞

Pe( C∼n) ≤ ε.

The infimum of all ε-achievable source coding rates for source X is denotedby T ε(X). Also define T (X) , sup0<ε<1 T ε(X) = limε↓0 T ε(X) = T 0(X) asthe optimistic source coding rate.

We can then use Lemmas 3.3 and 3.4 (in a similar fashion to the general sourcecoding theorem) to prove the general optimistic (fixed-length) source codingtheorems.

Theorem 6.3 (optimistic minimum ε-achievable source coding rate for-mula) For any source X,

T ε(X) ≤

limδ↑(1−ε)Hδ(X), for ε ∈ [0, 1);

0, for ε = 1.

100

We conclude this section by examining the expression of T (X) for infor-mation stable sources. It is already known (cf. for example [11]) that for aninformation stable source X,

T (X) = lim supn→∞

1

nH(Xn).

We herein prove a parallel expression for T (X).

Definition 6.4 (information stable sources [11]) A source X is said to beinformation stable if H(Xn) > 0 for n sufficiently large, and hXn(Xn)/H(Xn)converges in probability to one as n→∞, i.e.,

lim supn→∞

Pr

[∣∣∣∣hXn(Xn)

H(Xn)− 1

∣∣∣∣ > γ

]= 0 ∀γ > 0,

where H(Xn) = E[hXn(Xn)] is the entropy of Xn.

Lemma 6.5 Every information source X satisfies

T (X) = lim infn→∞

1

nH(Xn).

Proof:

1. [T (X) ≥ lim infn→∞(1/n)H(Xn)]

Fix ε > 0 arbitrarily small. Using the fact that hXn(Xn) is a non-negativebounded random variable for finite alphabet, we can write the normalized blockentropy as

1

nH(Xn) = E

[1

nhXn(Xn)

]= E

[1

nhXn(Xn) 1

0 ≤ 1

nhXn(Xn) ≤ lim

δ↑1Hδ(X) + ε

]+ E

[1

nhXn(Xn) 1

1

nhXn(Xn) > lim

δ↑1Hδ(X) + ε

].(6.2.1)

From the definition of limε↑1Hδ(X), it directly follows that the first term in theright hand side of (6.2.1) is upper bounded by limδ↑1Hδ(X) + ε, and that theliminf of the second term is zero. Thus

T (X) = limδ↑1Hδ(X) ≥ lim inf

n→∞

1

nH(Xn).

2. [T (X) ≤ lim infn→∞(1/n)H(Xn)]

101

Fix ε > 0. For infinitely many n,

Pr

hXn(Xn)

H(Xn)− 1 > ε

= Pr

1

nhXn(Xn) > (1 + ε)

(1

nH(Xn)

)≥ Pr

1

nhXn(Xn) > (1 + ε)

(lim infn→∞

1

nH(Xn) + ε

).

Since X is information stable, we obtain that

lim infn→∞

Pr

1

nhXn(Xn) > (1 + ε)

(lim infn→∞

1

nH(Xn) + ε

)= 0.

By the definition of limδ↑1Hδ(X), the above implies that

T (X) = limδ↑1Hδ(X) ≤ (1 + ε)

(lim infn→∞

1

nH(Xn) + ε

).

The proof is completed by noting that ε can be made arbitrarily small. 2

It is worth pointing out that if the source X is both information stable andstationary, the above Lemma yields

T (X) = T (X) = limn→∞

1

nH(Xn).

This implies that given a stationary and information stable source X, the clas-sical separation theorem holds for every channel.

6.3 Optimistic channel coding theorems

In this section, we state without proving the general expressions for the opti-mistic ε-capacity2 (Cε) and for the optimistic capacity (C) of arbitrary single-user channels. The proofs of these expressions are straightforward once the rightdefinition (of Iε(X;Y )) is made. They employ Feinstein’s Lemma and the Poor-Verdu bound, and follow the same arguments used in Theorem 5.10 to show thegeneral expressions of the conventional channel capacity

C = supX

I0(X;Y ) = supX

I(X;Y ),

2The authors would like to point out that the expression of Cε was also separately obtainedin [10, Theorem 7].

102

and the conventional ε-capacity

supX

limδ↑ε

Iδ(X;Y ) ≤ Cε ≤ supX

Iε(X;Y ).

We close this section by proving the formula of C for information stable channels.

Definition 6.6 (optimistic ε-achievable rate) Fix 0 < ε < 1. R ≥ 0 is anoptimistic ε-achievable rate if there exists a sequence of C∼n = (n,Mn) channelblock codes such that

lim infn→∞

1

nlogMn ≥ R

andlim infn→∞

Pe( C∼n) ≤ ε.

Definition 6.7 (optimistic ε-capacity Cε) Fix 0 < ε < 1. The supremum ofoptimistic ε-achievable rates is called the optimistic ε-capacity, Cε.

It is straightforward for the definition that Cε is non-decreasing in ε, andC1 = log |X |.

Definition 6.8 (optimistic capacity C) The optimistic channel capacity Cis defined as the supremum of the rates that are ε-achievable for all ε ∈ [0, 1]. Itfollows immediately from the definition that C = inf0≤ε≤1Cε = limε↓0Cε = C0

and that C is the supremum of all the rates R for which there exists a sequenceof C∼n = (n,Mn) channel block codes such that

lim infn→∞

1

nlogMn ≥ R,

andlim infn→∞

Pe( C∼n) = 0.

Theorem 6.9 (optimistic ε-capacity formula) Fix 0 < ε < 1. The opti-mistic ε-capacity Cε satisfies

supX

limδ↑ε

Iδ(X;Y ) ≤ Cε ≤ supXIε(X;Y ). (6.3.1)

Note that actually Cε = supX Iε(X;Y ), except possibly at the points of discon-tinuities of supX Iε(X;Y ) (which are countable).

Theorem 6.10 (optimistic capacity formula) The optimistic capacity Csatisfies

C = supXI0(X;Y ).

103

We next investigate the expression of C for information stable channels.The expression for the capacity of information stable channels is already known(cf. for example [11])

C = lim infn→∞

Cn.

We prove a dual formula for C.

Definition 6.11 (Information stable channels [6, 8]) A channel W is saidto be information stable if there exists an input processX such that 0 < Cn <∞for n sufficiently large, and

lim supn→∞

Pr

[∣∣∣∣iXnWn(Xn;Y n)

nCn− 1

∣∣∣∣ > γ

]= 0

for every γ > 0.

Lemma 6.12 Every information stable channel W satisfies

C = lim supn→∞

supXn

1

nI(Xn;Y n).

Proof:

1. [C ≤ lim supn→∞ supXn(1/n)I(Xn;Y n)]

By using a similar argument as in the proof of [12, Theorem 8, property h)],we have

I0(X;Y ) ≤ lim supn→∞

supXn

1

nI(Xn;Y n).

Hence,

C = supXI0(X;Y ) ≤ lim sup

n→∞supXn

1

nI(Xn;Y n).

2. [C ≥ lim supn→∞ supXn(1/n)I(Xn;Y n)]

Suppose X is the input process that makes the channel information stable.Fix ε > 0. Then for infinitely many n,

PXnWn

[1

niXnWn(Xn;Y n) ≤ (1− ε)(lim sup

n→∞Cn − ε)

]≤ PXnWn

[iXnWn(Xn;Y n)

n< (1− ε)Cn

]

= PXnWn

[iXnWn(Xn;Y n)

nCn− 1 < −ε

].

104

Since the channel is information stable, we get that

lim infn→∞

PXnWn

[1

niXnWn(Xn;Y n) ≤ (1− ε)(lim sup

n→∞Cn − ε)

]= 0.

By the definition of C, the above immediately implies that

C = supXI0(X;Y ) ≥ I0(X;Y ) ≥ (1− ε)(lim sup

n→∞Cn − ε).

Finally, the proof is completed by noting that ε can be made arbitrarily small.2

Observations:

• It is known that for discrete memoryless channels, the optimistic capacity Cis equal to the (conventional) capacity C [12, 5]. The same result holds formodulo−q additive noise channels with stationary ergodic noise. However,in general, C ≥ C since I0(X;Y ) ≥ I(X;Y ) [3, 4].

• Remark that Theorem 11 in [11] holds if, and only if,

supX

I(X;Y ) = supXI0(X;Y ).

Furthermore, note that, if C = C and there exists an input distributionPX that achieves C, then PX also achieves C.

6.4 Examples

We provide four examples to illustrate the computation of C and C. The first twoexamples present information stable channels for which C > C. The third exam-ple shows an information unstable channel for which C = C. These examples in-dicate that information stability is neither necessary nor sufficient to ensure thatC = C or thereby the validity of the classical source-channel separation theorem.The last example illustrates the situation where 0 < C < C < CSC < log2 |Y|,where CSC is the channel strong capacity3. We assume in this section that alllogarithms are in base 2 so that C and C are measured in bits.

3The strong (or strong converse) capacity CSC (cf. [2]) is defined as the infimum of the num-bers R for which for all (n,Mn) codes with (1/n) logMn ≥ R, lim infn→∞ Pe( C∼n) = 1. This def-inition of CSC implies that for any sequence of (n,Mn) codes with lim infn→∞(1/n) logMn >CSC , Pe( C∼n) > 1 − ε for every ε > 0 and for n sufficiently large. It is shown in [2] thatCSC = limε↑1 Cε = supX I(X;Y ).

105

6.4.1 Information stable channels

Example 6.13 Consider a nonstationary channel W such that at odd time in-stances n = 1, 3, · · · , W n is the product of the transition distribution of a binarysymmetric channel with crossover probability 1/8 (BSC(1/8)), and at even timeinstances n = 2, 4, 6, · · · , W n is the product of the distribution of a BSC(1/4). Itcan be easily verified that this channel is information stable. Since the channelis symmetric, a Bernoulli(1/2) input achieves Cn = supXn(1/n)I(Xn;Y n); thus

Cn =

1− hb(1/8), for n odd;

1− hb(1/4), for n even,

where hb(a) , −a log2 a − (1 − a) log2(1 − a) is the binary entropy function.Therefore, C = lim infn→∞Cn = 1 − hb(1/4) and C = lim supn→∞Cn = 1 −hb(1/8) > C.

Example 6.14 Here we use the information stable channel provided in [11,Section III] to show that C > C. Let N be the set of all positive integers.Define the set J as

J , n ∈ N : 22i+1 ≤ n < 22i+2, i = 0, 1, 2, . . .= 2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 32, 33, · · · , 63, 128, 129, · · · , 255, · · · .

Consider the following nonstationary symmetric channel W . At times n ∈ J ,Wn is a BSC(0), whereas at times n 6∈ J , Wn is a BSC(1/2). Put W n =W1 ×W2 × · · · ×Wn. Here again Cn is achieved by a Bernoulli(1/2) input Xn.We then obtain

Cn =1

n

n∑i=1

I(Xi;Yi) =1

n[J(n) · 1 + (n− J(n)) · 0] =

J(n)

n,

where J(n) , |J ∩ 1, 2, · · · , n|. It can be shown that

J(n)

n=

1− 2

3× 2blog2 nc

n+

1

3n, for blog2 nc odd;

2

3× 2blog2 nc

n− 2

3n, for blog2 nc even.

Consequently, C = lim infn→∞Cn = 1/3 and C = lim supn→∞Cn = 2/3.

6.4.2 Information unstable channels

Example 6.15 The Polya-contagion channel: Consider a discrete additive cha-nnel with binary input and output alphabet 0, 1 described by

Yi = Xi ⊕ Zi, i = 1, 2, · · · ,

106

whereXi, Yi and Zi are respectively the i-th input, i-th output and i-th noise, and⊕ represents modulo-2 addition. Suppose that the input process is independentof the noise process. Also assume that the noise sequence Znn≥1 is drawnaccording to the Polya contagion urn scheme [1, 9] as follows: an urn originallycontains R red balls and B black balls with R < B; the noise just make successivedraws from the urn; after each draw, it returns to the urn 1 + ∆ balls of thesame color as was just drawn (∆ > 0). The noise sequence Zi corresponds tothe outcomes of the draws from the Polya urn: Zi = 1 if ith ball drawn is redand Zi = 0, otherwise. Let ρ , R/(R + B) and δ , ∆/(R + B). It is shown in[1] that the noise process Zi is stationary and nonergodic; thus the channel isinformation unstable.

From Lemma 2 and Section IV in [4, Part I], we obtain

1− H1−ε(X) ≤ Cε ≤ 1− limδ↑(1−ε)

Hδ(X),

and1−H1−ε(X) ≤ Cε ≤ 1− lim

δ↑(1−ε)Hδ(X).

It has been shown [1] that −(1/n) logPXn(Xn) converges in distribution to thecontinuous random variable V , hb(U), where U is beta-distributed with pa-rameters (ρ/δ, (1− ρ)/δ), and hb(·) is the binary entropy function. Thus

H1−ε(X) = limδ↑(1−ε)

Hδ(X) =H1−ε(X) = limδ↑(1−ε)

Hδ(X) = F−1V (1− ε),

where FV (a) , PrV ≤ a is the cumulative distribution function of V , andF−1V (·) is its inverse [1]. Consequently,

Cε = Cε = 1− F−1V (1− ε),

andC = C = lim

ε↓0

[1− F−1

V (1− ε)]

= 0.

Example 6.16 Let W1, W2, . . . consist of the channel in Example 6.14, and letW1, W2, . . . consist of the channel in Example 6.15. Define a new channel W asfollows:

W2i = Wi and W2i−1 = Wi for i = 1, 2, · · · .

As in the previous examples, the channel is symmetric, and a Bernoulli(1/2)input maximizes the inf/sup information rates. Therefore for a Bernoulli(1/2)

107

input X, we have

Pr

1

nlog

PWn(Y n|Xn)

PY n(Y n)≤ θ

=

Pr

1

2i

[log

PW i(Y i|X i)

PY i(Y i)+ log

PW i(Y i|X i)

PY i(Y i)

]≤ θ

,

if n = 2i;

Pr

1

2i+ 1

[log

PW i(Y i|X i)

PY i(Y i)+ log

PW i+1(Y i+1|X i+1)

PY i+1(Y i+1)

]≤ θ

,

if n = 2i+ 1;

=

1− Pr−1

ilogPZi(Z

i) < 1− 2θ +1

iJ(i)

,

if n = 2i;

1− Pr− 1

i+ 1logPZi+1(Zi+1) < 1−

(2− 1

i+ 1

)θ +

1

i+ 1J(i)

,

if n = 2i+ 1.

The fact that −(1/i) log[PZi(Zi)] converges in distribution to the continuous ran-

dom variable V , hb(U), where U is beta-distributed with parameters (ρ/δ, (1−ρ)/δ), and the fact that

lim infn→∞

1

nJ(n) =

1

3and lim sup

n→∞

1

nJ(n) =

2

3

imply that

i(θ) , lim infn→∞

Pr

1

nlog

PWn(Y n|Xn)

PY n(Y n)≤ θ

= 1− FV

(5

3− 2θ

),

and


Pr

1

nlog

PWn(Y n|Xn)

PY n(Y n)≤ θ

= 1− FV

(4

3− 2θ

).

Consequently,

Cε =5

6− 1

2F−1V (1− ε) and Cε =

2

3− 1

2F−1V (1− ε).

Thus

0 < C =1

6< C =

1

3< CSC =

5

6< log2 |Y| = 1.

108

Bibliography

[1] F. Alajaji and T. Fuja, “A communication channel modeled on contagion,”IEEE Trans. on Information Theory, vol. IT–40, no. 6, pp. 2035–2041,Nov. 1994.

[2] P.-N. Chen and F. Alajaji, “Strong converse, feedback capacity and hypoth-esis testing,” Proc. of CISS, John Hopkins Univ., Baltimore, Mar. 1995.

[3] P.-N. Chen and F. Alajaji, “Generalization of information measures,”Proc. Int. Symp. Inform. Theory & Applications, Victoria, Canada, Septem-ber 1996.


[5] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems, Academic, New York, 1981.

[6] R. L. Dobrushin, “General formulation of Shannon’s basic theorems of infor-mation theory,” AMS Translations, vol. 33, pp. 323-438, AMS, Providence,RI, 1963.

[7] T. S. Han, Information-Spectrum Methods in Information Theory, (inJapanese), Baifukan Press, Tokyo, 1998.

[8] M. S. Pinsker, Information and Information Stability of Random Variablesand Processes, Holden-Day, 1964.

[9] G. Polya, “Sur quelques points de la theorie des probabilites,” Ann. Inst.H. Poincarre, vol. 1, pp. 117-161, 1931.

[10] Y. Steinberg, “New converses in the theory of identification via channels,”IEEE Trans. on Information Theory, vol. 44, no. 3, pp. 984–998, May 1998.

109

[11] S. Vembu, S. Verdu and Y. Steinberg, “The source-channel separation theo-rem revisited,” IEEE Trans. on Information Theory, vol. 41, no. 1, pp. 44–54, Jan. 1995.


110

Chapter 7

Lossy Data Compression

In this chapter, a rate-distortion theorem for arbitrary (not necessarily stationaryor ergodic) discrete-time finite-alphabet sources is shown. The expression of theminimum ε-achievable fixed-length coding rate subject to a fidelity criterion willalso be provided.

7.1 General lossy source compression for block codes

In this section, we consider the problem of source coding with a fidelity criterionfor arbitrary (not necessarily stationary or ergodic) discrete-time finite-alphabetsources. We prove a general rate-distortion theorem by establishing the expres-sion of the minimum ε-achievable block coding rate subject to a fidelity criterion.We will relax all the constraints on the source statistics and distortion measure.In other words, the source is not restricted to DMS, and the distortion measurecan be arbitrary.

Definition 7.1 (lossy compression block code) Given a finite source alpha-bet Z and a finite reproduction alphabet Z, a block code for data compressionof blocklength n and size M is a mapping fn(·) : Zn → Zn that resultsin ‖fn‖ = M codewords of length n, where each codeword is a sequence of nreproducing letters.

Definition 7.2 (distortion measure) A distortion measure ρn(·, ·) is a map-ping

ρn : Zn × Zn → <+ , [0,∞).

We can view the distortion measure as the cost of representing a source n-tuplezn by a reproduction n-tuple fn(zn).

111

Similar to the general results for entropy rate and information rate, the av-erage distortion measure no longer has nice properties, such as convergence-in-probability, in general. Instead, its ultimate probability may span over someregions. Hence, in order to derive the fundamental limit for lossy data compres-sion code rate, a different technique should be applied. This technique, whichemploying the spectrum of the normalized distortion density defined later, is infact the same as that for general lossless source coding theorem and generalchannel coding theorem.

Definition 7.3 (distortion inf-spectrum) Let (Z, Z) and ρn(·, ·)n≥1 begiven. The distortion inf-spectrum λZ,Z(θ) is defined by

λZ,Z(θ) , lim infn→∞

Pr

1

nρn(Zn, Zn) ≤ θ

.

Definition 7.4 (distortion inf-spectrum for lossy compression code f)Let Z and ρn(·, ·)n≥1 be given. Let f(·) , fn(·)∞n=1 denote a sequence of(lossy) data compression codes. The distortion inf-spectrum λZ,f(Z)(θ) for f(·)is defined by

λZ,f(Z)(θ) , lim infn→∞

Pr

1

nρn(Zn, fn(Zn)) ≤ θ

.

Definition 7.5 Fix D > 0 and 1 > ε > 0. R is a ε-achievable data compressionrate at distortion D for a source Z if there exists a sequence of data compressioncodes fn(·) with

lim supn→∞

1

nlog ‖fn‖ ≤ R,

andsup

[θ : λZ,f(Z)(θ) ≤ ε

]≤ D. (7.1.1)

Note that (7.1.1), which can be re-written as:

inf

[θ : lim sup

n→∞Pr

1

nρn(Zn, fn(Zn)) > θ

< 1− ε

]≤ D,

is equivalent to stating that the limsup of the probability of excessive distortion(i.e., distortion larger than D) is smaller than 1− ε.

The infimum ε-achievable data compression rate at distortion D for Z isdenoted by Tε(D,Z).

112

Theorem 7.6 (general data compression theorem) Fix D > 0 and 1 >ε > 0. Let Z and ρn(·, ·)n≥1 be given. Then

Rε(D) ≤ Tε(D,Z) ≤ Rε(D − γ),

for any γ > 0, where

Rε(D) , infPZ|Z : sup

[θ : λZ,Z(θ) ≤ ε

]≤ D

I(Z; Z),

where the infimum is taken over all conditional distributions PZ|Z for which thejoint distribution PZ,Z = PZPZ|Z satisfies the distortion constraint.

Proof:

1. Forward part (achievability): Tε(D,Z) ≤ Rε(D−γ) or Tε(D+γ,Z) ≤ Rε(D).Choose γ > 0. We will prove the existence of a sequence of data compressioncodes with

lim supn→∞

1

nlog ‖fn‖ ≤ Rε(D) + 2γ,

andsup

[θ : λZ,f(Z)(θ) ≤ ε

]≤ D + γ.

step 1: Let PZ|Z be the distribution achieving Rε(D).

step 2: Let R = Rε(D) + 2γ. Choose M = enR codewords of blocklength nindependently according to PZn , where

PZn(zn) =∑zn∈Zn

PZn(zn)PZn|Zn(zn|zn),

and denote the resulting random set by Cn.

step 3: For a given Cn, we denote by A(Cn) the set of sequences zn ∈ Zn suchthat there exists zn ∈ Cn with

1

nρn(zn, zn) ≤ D + γ.

step 4: Claim:lim supn→∞

EZn [PZn(Ac(Cn))] < ε.

The proof of this claim will be provided as a lemma following this theorem.

Therefore there exists (a sequence of) C∗n such that

lim supn→∞

PZn(Ac(C∗n)) < ε.

113

step 5: Define a sequence of codes fnn≥1 by

fn(zn) =

arg min

zn∈C∗nρn(zn, zn), if zn ∈ A(C∗n);

0, otherwise,

where 0 is a fixed default n-tuple in Zn.

Then zn ∈ Zn :

1

nρn(zn, fn(zn)) ≤ D + γ

⊃ A(C∗n),

since (∀ zn ∈ A(C∗n)) there exists zn ∈ C∗n such that (1/n)ρn(zn, zn) ≤ D+γ,which by definition of fn implies that (1/n)ρn(zn, fn(zn)) ≤ D + γ.

step 6: Consequently,

λ(Z,f(Z))(D + γ) = lim infn→∞

PZn

zn ∈ Zn :

1

nρn(zn, f(zn)) ≤ D + γ

≥ lim inf

n→∞PZn(A(C∗n))

= 1− lim supn→∞

PZn(Ac(C∗n))

> ε.

Hence,sup

[θ : λZ,f(Z)(θ) ≤ ε

]≤ D + γ,

where the last step is clearly depicted in Figure 7.1.

This proves the forward part.

2. Converse part: Tε(D,Z) ≥ Rε(D). We show that for any sequence of encodersfn(·)∞n=1, if

sup[θ : λZ,f(Z)(θ) ≤ ε

]≤ D,

then

lim supn→∞

1

nlog ‖fn‖ ≥ Rε(D).

Let

PZn|Zn(zn|zn) ,

1, if zn = fn(zn);0, otherwise.

Then to evaluate the statistical properties of the random sequence

(1/n)ρn(Zn, fn(Zn)∞n=1

114

-

6

D + γ

λZ,f(Z)(D + γ)

λZ,f(Z)(θ)

θsup

[θ : λZ,f(Z)(θ) ≤ ε

]

ε

Figure 7.1: λZ,f(Z)(D + γ) > ε⇒ sup[θ : λZ,f(Z)(θ) ≤ ε] ≤ D + γ.

under distribution PZn is equivalent to evaluating those of the random sequence(1/n)ρn(Zn, Zn)∞n=1 under distribution PZn,Zn . Therefore,

Rε(D) , infPZ|Z : sup[θ : λZ,Z(θ) ≤ ε] ≤ D

I(Z; Z)

≤ I(Z; Z)

≤ H(Z)−H(Z|Z)

≤ H(Z)

≤ lim supn→∞

1

nlog ‖fn‖,

where the second inequality follows from (2.4.3) (with γ ↑ 1 and δ = 0), and thethird inequality follows from the fact thatH(Z|Z) ≥ 0. 2

Lemma 7.7 (cf. Proof of Theorem 7.6)

lim supn→∞

EZn [PZn(Ac(C∗n))] < ε.

Proof:

step 1: LetD(ε) , sup

θ : λZ,Z(θ) ≤ ε

.

115

Define

A(ε)n,γ ,

(zn, zn) :

1

nρn(zn, zn) ≤ D(ε) + γ

and1

niZn,Zn(zn, zn) ≤ I(Z; Z) + γ

.

Since

lim infn→∞

Pr

(D ,

1

nρn(Zn, Zn) ≤ D(ε) + γ

)> ε,

and

lim infn→∞

Pr

(E ,

1

niZn,Zn(Zn; Zn) ≤ I(Z; Z) + γ

)= 1,

we have

lim infn→∞

Pr(A(ε)n,γ) = lim inf

n→∞Pr(D ∩ E)

≥ lim infn→∞

Pr(D) + lim infn→∞

Pr(E)− 1

> ε+ 1− 1 = ε.

step 2: Let K(zn, zn) be the indicator function of A(ε)n,γ:

K(zn, zn) =

1, if (zn, zn) ∈ A(ε)

n,γ;0, otherwise.

step 3: By following a similar argument in step 4 of the achievability part ofTheorem 8.3 of Volume I, we obtain

EZn [PZn(Ac(C∗n))] =∑C∗n

PZn(C∗n)∑

zn 6∈A(C∗n)

PZn(zn)

=∑zn∈Zn

PZn(zn)∑

C∗n:zn 6∈A(C∗n)

PZn(C∗n)

=∑zn∈Zn

PZn(zn)

1−∑zn∈Zn

PZn(yn)K(zn, zn)

M

≤∑zn∈Zn

PZn(zn)(

1− e−n(I(Z;Z)+γ)

×∑zn∈Zn

PZn|Zn(zn|zn)K(zn, zn)

M

≤ 1−∑zn∈Zn

∑zn∈Zn

PZn(zn)PZn|Zn(zn, zn)K(zn, zn)

+ exp−en(R−Rε(D)−γ)

.

116

Therefore,

lim supn→∞

EZn [PZn(An(C∗n))] ≤ 1− lim infn→∞

Pr(A(ε)n,γ)

< 1− ε.

2

For the probability-of-error distortion measure ρn : Zn → Zn, namely,

ρn(zn, zn) =

n, if zn 6= zn;0, otherwise,

we define a data compression code fn : Zn → Zn based on a chosen (asymptotic)lossless fixed-length data compression code book C∼n ⊂ Zn:

fn(zn) =

zn, if zn ∈ C∼n;

0, if zn 6∈ C∼n,

where 0 is some default element in Zn. Then (1/n)ρn(zn, fn(zn)) is either 1 or0 which results in a cumulative distribution function as shown in Figure 7.2.Consequently, for any δ ∈ [0, 1),

Pr

1

nρn(Zn, fn(Zn)) ≤ δ

= Pr Zn = fn(Zn) .

-ccss

0 δ 1 D

1

PrZn = fn(Zn)

Figure 7.2: The CDF of (1/n)ρn(Zn, fn(Zn)) for the probability-of-error distortion measure.

117

By comparing the (asymptotic) lossless and lossy fixed-length compressiontheorems under the probability-of-error distortion measure, we observe that

Rε(δ) = infPZ|Z : sup[θ : λZ,Z(θ) ≤ ε] ≤ δ

I(Z; Z)

=

0, δ ≥ 1;

infPZ|Z : lim inf

n→∞PrZn = Zn > ε

I(Z; Z), δ < 1,

=

0, δ ≥ 1;

infPZ|Z : lim sup

n→∞PrZn 6= Zn ≤ 1− ε

I(Z; Z), δ < 1,

where

λZ,Z(θ) =

lim infn→∞

PrZn = Zn

, 0 ≤ θ < 1;

1, θ ≥ 1.

In particular, in the extreme case where ε goes to one,

H(Z) = infPZ|Z : lim sup

n→∞PrZn 6= Zn = 0

I(Z; Z).

Therefore, in this case, the data compression theorem reduces (as expected) tothe asymptotic lossless fixed-length data compression theorem.

118

Chapter 8

Hypothesis Testing

8.1 Error exponent and divergence

Divergence can be adopted as a measure of how similar two distributions are.This quantity has the operational meaning that it is the exponent of the type IIerror probability for hypothesis testing system of fixed test level. For rigorous-ness, the exponent is first defined below.

Definition 8.1 (exponent) A real number a is said to be the exponent for asequence of non-negative quantities ann≥1 converging to zero, if

a = limn→∞

(− 1

nlog an

).

In operation, exponent is an index of the exponential rate-of-convergence forsequence an. We can say that for any γ > 0,

e−n(a+γ) ≤ an ≤ e−n(a−γ),

as n large enough.

Recall that in proving the channel coding theorem, the probability of decod-ing error for channel block codes can be made arbitrarily close to zero when therate of the codes is less than channel capacity. This result can be mathematicallywritten as:

Pe( C∼∗n)→ 0, as n→∞,

provided R = lim supn→∞(1/n) log ‖ C∼∗n‖ < C, where C∼∗n is the optimal code forblock length n. From the theorem, we only know the decoding error will vanishas block length increases; but, it does not reveal that how fast the decoding errorapproaches zero. In other words, we do not know the rate-of-convergence of the

119

decoding error. Sometimes, this information is very important, especially forone to decide the sufficient block length to achieve some error bound.

The first step of investigating the rate-of-convergence of the decoding erroris to compute its exponent, if the decoding error decays to zero exponentiallyfast (it indeed does for memoryless channels.) This exponent, as a function ofthe rate, is in fact called the channel reliability function, and will be discussedin the next chapter.

For the hypothesis testing problems, the type II error probability of fixed testlevel also decays to zero as the number of observations increases. As it turnsout, its exponent is the divergence of the null hypothesis distribution againstalternative hypothesis distribution.

Lemma 8.2 (Stein’s lemma) For a sequence of i.i.d. observations Xn whichis possibly drawn from either null hypothesis distribution PXn or alternativehypothesis distribution PXn , the type II error satisfies

(∀ ε ∈ (0, 1)) limn→∞

− 1

nlog β∗n(ε) = D(PX‖PX),

where β∗n(ε) = minαn≤ε βn, and αn and βn represent the type I and type II errors,respectively.

Proof: [1. Forward Part]

In the forward part, we prove that there exists an acceptance region for nullhypothesis such that

lim infn→∞

− 1

nlog βn(ε) ≥ D(PX‖PX).

step 1: divergence typical set. For any δ > 0, define divergence typical setas

An(δ) ,

xn :

∣∣∣∣ 1n logPXn(xn)

PXn(xn)−D(PX‖PX)

∣∣∣∣ < δ

.

Note that in divergence typical set,

PXn(xn) ≤ PXn(xn)e−n(D(PX‖PX)−δ).

step 2: computation of type I error. By weak law of large number,

PXn(An(δ))→ 1.

Hence,αn = PXn(Acn(δ)) < ε,

for sufficiently large n.

120

step 3: computation of type II error.

βn(ε) = PXn(An(δ))

=∑

xn∈An(δ)

PXn(xn)

≤∑

xn∈An(δ)

PXn(xn)e−n(D(PX‖PX)−δ)

≤ e−n(D(PX‖PX)−δ)∑

xn∈An(δ)

PXn(xn)

≤ e−n(D(PX‖PX)−δ)(1− αn).

Hence,

− 1

nlog βn(ε) ≥ D(PX‖PX)− δ +

1

nlog(1− αn),

which implies

lim infn→∞

− 1

nlog βn(ε) ≥ D(PX‖PX)− δ.

The above inequality is true for any δ > 0; therefore

lim infn→∞

− 1

nlog βn(ε) ≥ D(PX‖PX).

[2. Converse Part]

In the converse part, we will prove that for any acceptance region for nullhypothesis Bn satisfying the type I error constraint, i.e.,

αn(Bn) = PXn(Bcn) ≤ ε,

its type II error βn(Bn) satisfies

lim supn→∞

− 1

nlog βn(Bn) ≤ D(PX‖PX).

βn(Bn) = PXn(Bn) ≥ PXn(Bn ∩ An(δ))

≥∑

xn∈Bn∩An(δ)

PXn(xn)

≥∑

xn∈Bn∩An(δ)

PXn(xn)e−n(D(PX‖PX)+δ)

= e−n(D(PX‖PX)+δ)PXn(Bn ∩ An(δ))

≥ e−n(D(PX‖PX)+δ) [1− PXn(Bcn)− PXn (Acn(δ))]

≥ e−n(D(PX‖PX)+δ) [1− αn(Bn)− PXn (Acn(δ))]

≥ e−n(D(PX‖PX)+δ) [1− ε− PXn (Acn(δ))] .

121

Hence,

− 1

nlog βn(Bn) ≤ D(PX‖PX) + δ +

1

nlog [1− ε− PXn (Acn(δ))] ,

which implies that

lim supn→∞

− 1

nlog βn(Bn) ≤ D(PX‖PX) + δ.

The above inequality is true for any δ > 0; therefore,

lim supn→∞

− 1

nlog βn(Bn) ≤ D(PX‖PX).

2

8.1.1 Composition of sequence of i.i.d. observations

Stein’s lemma gives the exponent of the type II error probability for fixed testlevel. As a result, this exponent, which is the divergence of null hypothesisdistribution against alternative hypothesis distribution, is independent of thetype I error bound ε for i.i.d. observations.

Specifically under i.i.d. environment, the probability for each sequence of xn

depends only on its composition, which is defined as an |X |-dimensional vector,and is of the form (

#1(xn)

n,#2(xn)

n, . . . ,

#k(xn)

n

),

where X = 1, 2, . . . , k, and #i(xn) is the number of occurrences of symbol iin xn. The probability of xn is therefore can be written as

PXn(xn) = PX(1)#1(xn) × PX(2)#2(xn) × PX(k)#k(xn).

Note that #1(xn) + · · · + #k(xn) = n. Since the composition of a sequencedecides its probability deterministically, all sequences with the same compositionshould have the same statistical property, and hence should be treated the samewhen processing. Instead of manipulating the sequences of observations basedon the typical-set-like concept, we may focus on their compositions. As it turnsout, such approach yields simpler proofs and better geometrical explanations fortheories under i.i.d. environment. (It needs to be pointed out that for cases whencomposition alone can not decide the probability, this viewpoint does not seemto be effective.)

Lemma 8.3 (polynomial bound on number of composition) The num-ber of compositions increases polynomially fast, while the number of possiblesequences increases exponentially fast.

122

Proof: Let Pn denotes the set of all possible compositions for xn ∈ X n. Thensince each numerator of the components in |X |-dimensional composition vectorranges from 0 to n, it is obvious that |Pn| ≤ (n + 1)|X | which increases polyno-mially with n. However, the number of possible sequences is |X |n which is ofexponential order w.r.t. n. 2

Each composition (or |X |-dimensional vector) actually represents a possibleprobability mass distribution. In terms of the composition distribution, we cancompute the exponent of the probability of those observations that belong tothe same composition.

Lemma 8.4 (probability of sequences of the same composition) Theprobability of the sequences of composition C with respect to distribution PXn

satisfies1

(n+ 1)|X |e−nD(PC‖PX) ≤ PXn(C) ≤ e−nD(PC‖PX),

where PC is the composition distribution for composition C, and C (by abusingnotation without ambiguity) is also used to represent the set of all sequences (inX n) of composition C.

Proof: For any sequence xn of composition

C =

(#1(xn)

n,#2(xn)

n, . . . ,

#k(xn)

n

)= (PC(1), PC(2), . . . , PC(k)) ,

where k = |X |,

PXn(xn) =k∏i=1

PX(i)#i(xn)

=k∏x=1

PX(x)n·PC(x)

=k∏x=1

en·PC(x) logPX(x)

= en∑x∈X PC(x) logPX(x)

= e−n(∑x∈X PC(x) log[PC(x)/PX(x)]−PC(x) logPC(x))

= e−n(D(PC‖PX)+H(PC)).

Similarly, we have

PC(xn) = e−n[D(PC‖PC)+H(PC)] = e−nH(PC). (8.1.1)

123

(PC originally represents a one-dimensional distribution over X . Here, withoutambiguity, we re-use the same notation of PC(·) to represent its n-dimensionalextension, i.e., PC(x

n) =∏n

i=1 PC(xi).) We can therefore bound the number ofsequence in C as

1 ≥∑xn∈C

PC(xn)

=∑xn∈C

e−nH(PC)

= |C|e−nH(PC),

which implies |C| ≤ enH(PC). Consequently,

PXn(C) =∑xn∈C

PXn(xn)

=∑xn∈C

e−n(D(PC‖PX)+H(PC))

= |C|e−n(D(PC‖PX)+H(PC))

≤ enH(PC)e−n(D(PC‖PX)+H(PC))

= e−n·D(PC‖PX). (8.1.2)

This ends the proof of the upper bound.

From (8.1.2), we observe that to derive a lower bound of PXn(C) is equiva-lent to find a lower bound of |C|, which requires the next inequality. For anycomposition C ′, and its corresponding composition set C ′,

PC(C)PC(C ′)

=|C||C ′|

∏i∈X PC(i)

n·PC(i)∏i∈X PC(i)

n·PC′ (i)

=

n!∏j∈X (n · PC(j))!

n!∏j∈X (n · PC′(j))!

×∏

i∈X PC(i)n·PC(i)∏

i∈X PC(i)n·PC′ (i)

=∏j∈X

(n · PC′(j))!(n · PC(j))!

×∏

i∈X PC(i)n·PC(i)∏

i∈X PC(i)n·PC′ (i)

≥∏j∈X

(n · PC(j))n·PC′ (j)−n·PC(j)∏i∈X

PC(i)n·PC(i)−n·PC′ (i)

using the inequalitym!

n!≥ nm

nn;

= nn∑j∈X (PC′ (j)−PC(j))

= n0 = 1. (8.1.3)

124

Hence,

1 =∑C′

PC(C ′)

≤∑C′

maxC′′

PC(C ′′)

=∑C′

PC(C) (8.1.4)

≤ (n+ 1)|X |PC(C)= (n+ 1)|X ||C|e−nH(PC), (8.1.5)

where (8.1.4) and (8.1.5) follow from (8.1.3) and (8.1.1), respectively. Therefore,a lower bound of |C| is obtained as:

|C| ≥ 1

(n+ 1)|X |enH(PC).

2

Theorem 8.5 (Sanov’s Theorem) For a set of compositions En and a giventrue distribution PXn , the exponent of PXn(En) is given by

minC∈En

D(PC‖PX),

where PC is the composition distribution for composition C.

Proof:

PXn(En) =∑C∈En

PXn(C)

≤∑C∈En

e−nD(PC‖PX)

≤∑C∈En

maxC∈En

e−nD(PC‖PX)

=∑C∈En

e−n·minC∈En D(PC‖PX)

≤ (n+ 1)|X |e−n·minC∈En D(PC‖PX).

Let C∗ ∈ En satisfying that

D(PC∗‖PX) = minC∈En

D(PC‖PX).

125

PXn(En) =∑C∈En

PXn(C)

≥∑C∈En

1

(n+ 1)|X |e−nD(PC‖PX)

≥ 1

(n+ 1)|X |e−nD(PC∗‖PX).

Therefore,

1

(n+ 1)|X |e−nD(PC∗‖PX) ≤ PXn(En) ≤ (n+ 1)|X |e−n·minC∈En D(PC‖PX),

which implies

limn→∞

− 1

nlogPXn(En) = min

C∈EnD(PC‖PX).

2

The geometrical explanation for Sanov’s theorem is depicted in Figure 8.1.In the figure, it shows that if we adopt the divergence as the measure of distancein distribution space (the outer triangle region in Figure 8.1), then En is just aset of (empirical) distributions. Then for any distribution PX outside En, we cancertainly measure the shortest distance from PX to En in terms of the divergencemeasure. This shortest distance, as it turns out, is indeed the exponent ofPX(En). The following examples show that the viewpoint from Sanov is reallyeffective, especially for i.i.d. observations.

@

@@@@@@@@@@@

En

uPX

uPC1uPC2CC

CC

CC

minC∈En

D(PC‖PX)

Figure 8.1: The geometric meaning for Sanov’s theorem.

Example 8.6 One wants to roughly estimate the probability that the average ofthe throws is greater or equal to 4, when tossing a fair dice n times. He observesthat whether the requirement is satisfied only depends on the compositions of theobservations. Let En be the set of compositions which satisfy the requirement.

126

Also let PX be the probability for tossing a fair dice. Then En can be written as

En =

C :

6∑i=1

iPC(i) ≥ 4

.

To minimize D(PC‖PX) for C ∈ En, we can use the Lagrange multiplier tech-nique (since divergence is convex with respect to the first argument.) with theconstraints on PC being:

6∑i=1

iPC(i) = k and6∑i=1

PC(i) = 1

for k = 4, 5, 6, . . . , n. So it becomes to minimize:

6∑i=1

PC(i) logPC(i)

PX(i)+ λ1

(6∑i=1

iPC(i)− k

)+ λ2

(6∑i=1

PC(i)− 1

).

By taking the derivatives, we found that the minimizer should be of the form

PC(i) =eλ1·i∑6j=1 e

λ1·j,

for λ1 is chosen to satisfy6∑i=1

iPC(i) = k. (8.1.6)

Since the above is true for all k ≥ 4, it suffices to take the smallest one asour solution, i.e., k = 4. Finally, by solving (8.1.6) for k = 4 numerically, theminimizer should be

PC∗ = (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468),

and the exponent of the desired probability is D(PC∗‖PX) = 0.0433 nat. Conse-quently,

PXn(En) ≈ e−0.0433n.

8.1.2 Divergence typical set on composition

In Stein’s lemma, one of the most important steps is to define the divergencetypical set, which is

An(δ) ,

xn :

∣∣∣∣ 1n logPXn(xn)

PXn(xn)−D(PX‖PX)

∣∣∣∣ < δ

.

127

One of the key feature of the divergence typical set is that its probability eventu-ally tends to 1 with respect to PXn . In terms of the compositions, we can defineanother divergence typical set with the same feature as

Tn(δ) , xn ∈ X n : D(PCxn‖PX) ≤ δ,

where Cxn represents the composition of xn. The above typical set is re-writtenin terms of composition set as:

Tn(δ) , C : D(PC‖PX) ≤ δ.

PXn(Tn(δ))→ 1 is justified by

1− PXn(Tn(δ)) =∑

C : D(PC‖PX)>δ

PXn(C)

≤∑

C : D(PC‖PX)>δ

e−nD(PC‖PX), from Lemma 8.4.

≤∑

C : D(PC‖PX)>δ

e−nδ

≤ (n+ 1)|X |e−nδ, cf. Lemma 8.3.

8.1.3 Universal source coding on compositions

It is sometimes not possible to know the true source distribution. In such case,instead of finding the best lossless data compression code for a specific sourcedistribution, one may turn to design a code which is good for all possible candi-date distributions (of some specific class.) Here, “good” means that the averagecodeword length achieves the (unknown) source entropy for all sources with thedistributions in the considered class.

To be more precise, let

fn : X n →∞⋃i=1

0, 1i

be a fixed encoding function. Then for any possible source distribution PXn inthe considered class, the resultant average codeword length for this specific codesatisfies

1

n

∑i=1

PXn(xn)`(fn(xn))→ H(X),

as n goes to infinity, where `(·) represents the length of fn(xn). Then fn is saidto be a universal code for the source distributions in the considered class.

128

Example 8.7 (universal encoding using composition) First, binary-indexthe compositions using log2(n + 1)|X | bits, and denote this binary index forcomposition C by a(C). Note that the number of compositions is at most (n +1)|X |.

Let Cxn denote the composition with respect to xn, i.e. xn ∈ Cxn .

For each composition C, we know that the number of sequence xn in C isat most 2n·H(PC) (Here, H(PC) is measured in bits. I.e., the logarithmic basein entropy computation is 2. See the proof of Lemma 8.4). Hence, we canbinary-index the elements in C using n · H(PC) bits. Denote this binary indexfor elements in C by b(C). Define a universal encoding function fn as

fn(xn) = concatenationa(Cxn), b(Cxn).

Then this encoding rule is a universal code for all i.i.d. sources.

Proof: All the logarithmic operations in entropy and divergence are taken to bebase 2 throughout the proof.

For any i.i.d. distribution with marginal PX , its associated average codewordlength is

¯n =

∑xn∈Xn

PXn(xn)`(a(Cxn)) +∑xn∈Xn

PXn(xn)`(b(Cxn))

≤∑xn∈Xn

PXn(xn) · log2(n+ 1)|X | +∑xn∈Xn

PXn(xn) · n ·H(PCxn )

≤ |X | · log2(n+ 1) +∑C

PXn(C) · n ·H(PC).

Hence, the normalized average code length is upper-bounded by

1

n¯n ≤|X | × log2(n+ 1)

n+∑C

PXn(C)H(PC).

Since the first term of the right hand side of the above inequality vanishes as ntends to infinity, only the second term has contribution to the ultimate normal-ized average code length, which in turn can be bounded above using typical set

129

Tn(δ) as:∑C

PXn(C)H(PC)

=∑

C∈Tn(δ)

PXn(C)H(PC) +∑

C6∈Tn(δ)

PXn(C)H(PC)

≤ maxC : D(PC‖PX)≤δ/ log(2)

H(PC) +∑

C : D(PC‖PX)>δ/ log(2)

PXn(C)H(PC)


H(PC) +∑


2−nD(PC‖PX)H(PC),

(From Lemma 8.4)


H(PC) +∑


e−nδH(PC)


H(PC) +∑


e−nδ log2 |X |


H(PC) + (n+ 1)|X |e−nδ log2 |X |,

where the second term of the last step vanishes as n → ∞. (Note that whenbase-2 logarithm is taken in divergence instead of natural logarithm, the range[0, δ] in Tn(δ) should be replaced by [0, δ/ log(2)].) It remains to show that

maxC : D(PC‖PX)≤δ/ log(2)

H(PC) ≤ H(X) + γ(δ),

where γ(δ) only depends on δ, and approaches zero as δ → 0.

Since D(PC‖PX) ≤ δ/ log(2),

‖PC − PX‖ ≤√

2 · log(2) ·D(PC‖PX) ≤√

2δ,

where ‖PC − PX‖ represents the variational distance of PC against PX , whichimplies that for all x ∈ X ,

|PC(x)− PX(x)| ≤∑x∈X

|PC(x)− PX(x)|

= ‖PC − PX‖≤√

2δ.

Fix 0 < ε < minminx : PX(x)>0 PX(x), 2/e2. Choose δ = ε2/8. Then forall x ∈ X , |PC(x)− PX(x)| ≤ ε/2.

130

8.1.4 Likelihood ratio versus divergence

Recall that the Neyman-Pearson lemma indicates that the optimal test for twohypothesis (H0 : PXn against H1 : PXn) is of the form

PXn(xn)

PXn(xn)>< τ (8.1.7)

for some τ . This is the likelihood ratio test, and the quantity PXn(xn)/PXn(xn)is called the likelihood ratio. If the logarithm operation is taken on both sides of(8.1.7), the test remains intact; and the resultant quantity becomes log-likelihoodratio, which can be re-written in terms of the divergence of the compositiondistribution against the hypothesis distributions. To be precise, let Cxn be thecomposition of xn, and let PCxn be the corresponding composition distribution.Then

logPXn(xn)

PXn(xn)=

n∑i=1

logPX(xi)

PX(xi)

=∑a∈X

[#a(xn)] logPX(a)

PX(a)

=∑a∈X

[nPCxn (a)] logPX(a)

PX(a)

= n ·∑a∈X

PCxn (a) logPX(a)

PCxn (a)

PCxn (a)

PX(a)

= n

[∑a∈X

PCxn (a) logPCxn (a)

PX(a)−∑a∈X

PCxn (a) logPCxn (a)

PX(a)

]= n [D(PCxn‖PX)−D(PCxn‖PX)]

Hence, (8.1.7) is equivalent to

D(PCxn‖PX)−D(PCxn‖PX) ><1

nlog τ. (8.1.8)

This equivalence means that for hypothesis testing under i.i.d. observations,selection of the acceptance region suffices to be made upon compositions insteadof observations. In other words, the optimal decision function can be defined as:

φ(C) =

0, if composition C is classified to belong to null hypothesis

according to (8.1.8);1, otherwise.

132

8.1.5 Exponent of Bayesian cost

Randomization on decision φ(·) can improve the resultant type-II error forNeyman-Pearson criterion of fixed test level; however, it cannot improve Baye-sian cost. The latter statement can be easily justified by noting the contributionto Bayesian cost for randomizing φ(·) as

φ(xn) =

0, with probability η;

1, with probability 1− η;

satisfies

π0ηPXn(xn) + π1(1− η)PXn(xn) ≥ minπ0PXn(xn), π1PXn(xn).

Therefore, assigning xn to either null hypothesis or alternative hypothesis im-proves the Bayesian cost for ε ∈ (0, 1). Since under i.i.d. statistics, all observa-tions with the same composition have the same probability, the above statementis also true for decision function defined based on compositions.

Now suppose the acceptance region for null hypothesis is chosen to be a setof compositions, denoted by

A , C : D(PC‖PX)−D(PC‖PX) > τ ′.

Then by Sanov’s theorem, the exponent of type II error, βn, is

minC∈A

D(PC‖PX).

Similarly, the exponent of type I error, αn, is

minC∈Ac

D(PC‖PX).

Because whether or not an element xn with composition Cxn is in A is deter-mined by the difference of D(PCxn‖PX) and D(PCxn‖PX), it can be expectedthat as n goes to infinity, there exists PX such that the exponents for type Ierror and type II error are respectively D(PX‖PX) and D(PX‖PX) (cf. Figure8.2.) The solution for PX can be solved using Lagrange multiplier technique byre-formulating the problem as minimizing D(PC‖PX) subject to the conditionD(PC‖PX) = D(PC‖PX)− τ ′. By taking the derivative of

D(PX‖PX) + λ[D(PX‖PX)−D(PX‖PX)− τ ′

]+ ν

(∑x∈X

PX(x)− 1

)with respective to each PX(x), we have

logPX(x)

PX(x)+ 1 + λ · log

PX(x)

PX(x)+ ν = 0.

133

Then the optimal PX(x) satisfies:

PX(x) = Pλ(x) ,P λX(x)P 1−λ

X(x)∑

a∈X PλX(a)P 1−λ

X(a)

.

t tPX PXD(PX‖PX)

tPXD(PX‖PX)

D(PC‖PX) = D(PC‖PX)− τ ′

Figure 8.2: The divergence view on hypothesis testing.

The geometrical explanation for Pλ is that it locates on the “straight line”(in the sense of divergence measure) between PX and PX over the probabilityspace. When λ → 0, Pλ → PX ; when λ → 1, Pλ → PX . Usually, Pλ is namedthe tilted or twisted distribution. The value of λ is dependent on τ ′ = (1/n) log τ .

It is known from detection theory that the best τ for Bayes testing is π1/π0,which is fixed. Therefore,

τ ′ = limn→∞

1

nlog

π1

π0

= 0,

which implies that the optimal exponent for Bayes error is the minimum ofD(Pλ‖PX) subject to D(Pλ‖PX) = D(Pλ‖PX), namely the mid-point of the linesegment (PX , PX) on probability space. This quantity is called the Chernoffbound.

134

8.2 Large deviations theory

The large deviations theory basically consider the technique of computing theexponent of an exponentially decayed probability. Such technique is very usefulin understanding the exponential detail of error probabilities. You have alreadyseen a few good examples in hypothesis testing. In this section, we will introduceits fundamental theories and techniques.

8.2.1 Tilted or twisted distribution

Suppose the probability of a set PX(An) decreasing down to zero exponentiallyfact, and its exponent is equal to a > 0. Over the probability space, let P denotethe set of those distributions PX satisfying PX(An) exhibits zero exponent. Thenapplying similar concept as Sanov’s theorem, we can expect that

a = minPX∈PD(PX‖PX).

Now suppose the minimizer of the above function happens at f(PX) = τ for some

constant τ and some differentiable function f(·) (e.g., P , PX : f(PX) = τ.)Then the minimizer should be of the form

(∀ a ∈ X ) PX(a) =PX(a)e

λ∂f(P

X)

∂PX

(a)∑a′∈X

PX(a′)eλ∂f(P

X)

∂PX

(a′)

.

As a result, PX is the resultant distribution from PX exponentially twisted viathe partial derivative of the function f(·). Note that PX is usually written as

P(λ)X since it is generated by twisting PX with twisted factor λ.

8.2.2 Conventional twisted distribution

The conventional definition for twisted distribution is based on the divergencefunction, i.e.,

f(PX) = D(PX‖PX)−D(PX‖PX). (8.2.1)

Since∂D(PX‖PX)

∂PX(a)= log

PX(a)

PX(a)+ 1,

135

the twisted distribution (with respect to the f(·) defined in (8.2.1)) becomes

(∀ a ∈ X ) PX(a) =PX(a)e

λ logPX

(a)

PX (a)∑a′∈X

PX(a′)eλ log

PX

(a′)PX (a′)

=P 1−λX (a)P λ

X(a)∑

a′∈X

P 1−λX (a)P λ

X(a)

Note that for Bayes testing, the probability of the optimal acceptance regions forboth hypotheses evaluated under the optimal twisted distribution (i.e., λ = 1/2)exhibits zero exponent.

8.2.3 Cramer’s theorem

Consider a sequence of i.i.d. random variables, Xn, and suppose that we areinterested in the probability of the set

X1 + · · ·+Xn

n> τ

.

Observe that (X1 + · · ·+Xn)/n can be re-written as∑a∈X

aPC(a).

Therefore, the set considered can be defined through function f(·):

f(PX) ,∑a∈X

aPX(a),

and its partial derivative with respect to PX(a) is a. The resultant twisteddistribution is

(∀ a ∈ X ) P(λ)X (a) =

PX(a)eλa∑a′∈X

PX(a′)eλa′ .

So the exponent of PXn(X1 + · · ·+Xn)/n > τ is

minPX

: D(PX‖PX)>τ

D(PX‖PX) = minP (λ)X : D(P

(λ)X ‖PX)>τ

D(P(λ)X ‖PX).

136

It should be pointed out that∑

a′∈X PX(a′)eλa′

is the moment generatingfunction of PX . The conventional Cramer’s result does not use the divergence.Instead, it introduced the large deviation rate function, defined by

IX(x) , supθ∈<

[θx− logMX(θ)], (8.2.2)

where MX(θ) , E[expθX] is the moment generating function of X. Using hisstatement, the exponent of PXn(X1 + · · · + Xn)/n > τ is respectively lower-and upper bounded by

infx≥τ

IX(x) and infx>τ

IX(x).

An example on how to obtain the exponent bounds is illustrated in the nextsubsection.

8.2.4 Exponent and moment generating function: an ex-ample

In this subsection, we re-derive the exponent of Pr(X1 + · · · + Xn)/n ≥ λusing the moment generating function for i.i.d. random variables Xi∞i=1 withmean E[X] = µ < λ and bounded variance. As a result, its exponent equalsIX(λ) (cf. (8.2.2)).

A) Preliminaries : Observe that since E[X] = µ < λ and E[|X − µ|2] <∞,

Pr

X1 + · · ·+Xn

n≥ λ

→ 0 as n→∞.

Hence, we can compute its rate of convergence (to zero).

B) Upper bound of the probability :

Pr

X1 + · · ·+Xn

n≥ λ

= Pr θ(X1 + · · ·+Xn) ≥ θnλ , for any θ > 0

= Pr exp (θ(X1 + · · ·+Xn)) ≥ exp (θnλ)

≤ E [exp (θ(X1 + · · ·+Xn))]

exp (θnλ)

=En [exp (θX))]

exp (θnλ)

=

(MX(θ)

exp (θλ)

)n.

137

Hence,

lim infn→∞

− 1

nPr

X1 + · · ·+Xn

n> λ

≥ θλ− logMX(θ).

Since the above inequality holds for every θ > 0, we have

lim infn→∞

− 1

nPr

X1 + · · ·+Xn

n> λ

≥ max

θ>0[θλ− logMX(θ)]

= θ∗λ− logMX(θ∗),

where θ∗ > 0 is the optimizer of the maximum operation. (The positivityof θ∗ can be easily verified by the concavity of the function θλ− logMX(θ)in θ, and it derivative at θ = 0 equals (λ−µ) which is strictly greater than0.) Consequently,

lim infn→∞

− 1

nPr

X1 + · · ·+Xn

n> λ

≥ θ∗λ− logMX(θ∗)

= supθ∈<

[θλ− logMX(θ)] = IX(λ).

C) Lower bound of the probability :

Define the twisted distribution of PX as

P(θ)X (x) ,

expθxPX(x)

MX(θ).

Let X(θ) be the random variable having distribution P(θ)X . Then from

i.i.d. property of Xn,

P(θ)Xn(xn) ,

expθ(x1 + · · ·+ xn)PXn(xn)

MnX(θ)

,

which implies

PXn(xn) = MnX(θ) exp−θ(x1 + · · ·+ xn)P (θ)

Xn(xn).

138

Note that X(θ)i ∞i=1 are also i.i.d. random variables. Then

Pr

X1 + · · ·+Xn

n≥ λ

=

∑[xn : x1+···+xn≥nλ]

PXn(xn)

=∑

[xn : x1+···+xn≥nλ]

MnX(θ) exp−θ(x1 + · · ·+ xn)P (θ)

Xn(xn)

= MnX(θ)

∑[xn : x1+···+xn≥nλ]

exp−θ(x1 + · · ·+ xn)P (θ)Xn(xn)

≥ MnX(θ)

∑[xn : n(λ+γ)>x1+···+xn≥nλ]

exp−θ(x1 + · · ·+ xn)P (θ)Xn(xn),

for γ > 0

≥ MnX(θ)

∑[xn : n(λ+γ)>x1+···+xn≥nλ]

exp−θn(λ+ γ)P (θ)Xn(xn), for θ > 0.

= MnX(θ) exp−θn(λ+ γ)

∑[xn : n(λ+γ)>x1+···+xn≥nλ]

P(θ)Xn(xn)

= MnX(θ) exp−θn(λ+ γ)Pr

λ+ γ >

X(θ)1 + · · ·+X

(θ)n

n≥ λ

.

Choosing θ = θ∗ (the one that maximizes θλ− logMX(θ)), and observingthat E[X(θ∗)] = λ, we obtain:

Pr

λ+ γ >

X(θ)1 + · · ·+X

(θ)n

n≥ λ

= Pr

√nγ

Var[X(θ∗)]>

(X(θ)1 − λ) + · · ·+ (X

(θ)n − λ)√

nVar[X(θ∗)]≥ 0

≥ Pr

B >

(X(θ)1 − λ) + · · ·+ (X

(θ)n − λ)√

nVar[X(θ∗)]≥ 0

,

for√n > BVar[X(θ∗)]/γ

→ Φ(B)− Φ(0) as n→∞,

where the last step follows from the fact that

(X(θ)1 − λ) + · · ·+ (X

(θ)n − λ)√

nVar[X(θ∗)]

converges in distribution to standard normal distribution Φ(·). Since B

139

can be made arbitrarily large,

Pr

λ+ γ >

X(θ)1 + · · ·+X

(θ)n

n≥ λ

→ 1

2.

Consequently,

lim supn→∞

− 1

nlogPr

X1 + · · ·+Xn

n≥ λ

≤ θ∗(λ+ γ)− logMX(θ∗)

= supθ∈<

[θλ− logMX(θ)] + θ∗γ = IX(λ) + θ∗γ.

The derivation is completed by noting that θ∗ is fixed, and γ can be chosenarbitrarily small.

8.3 Theories on Large deviations

In this section, we will derive inequalities on the exponent of the probability,PrZn/n ∈ [a, b], which is a slight extension of the Gartner-Ellis theorem.

8.3.1 Extension of Gartner-Ellis upper bounds

Definition 8.8 In this subsection, Zn∞n=1 will denote an infinite sequence ofarbitrary random variables.

Definition 8.9 Define

ϕn(θ) ,1

nlogE [exp θZn] and ϕ(θ) , lim sup

n→∞ϕn(θ).

The sup-large deviation rate function of an arbitrary random sequence Zn∞n=1

is defined asI(x) , sup

θ∈< : ϕ(θ)>−∞[θx− ϕ(θ)] . (8.3.1)

The range of the supremum operation in (8.3.1) is always non-empty sinceϕ(0) = 0, i.e. θ ∈ < : ϕ(θ) > −∞ 6= ∅. Hence, I(x) is always defined.With the above definition, the first extension theorem of Gartner-Ellis can beproposed as follows.

Theorem 8.10 For a, b ∈ < and a ≤ b,

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ − inf

x∈[a,b]I(x).

140

Proof: The proof follows directly from Theorem 8.13 by taking h(x) = x, andhence, we omit it. 2

The bound obtained in the above theorem is not in general tight. This canbe easily seen by noting that for an arbitrary random sequence Zn∞n=1, theexponent of Pr Zn/n ≤ b is not necessarily convex in b, and therefore, cannotbe achieved by a convex (sup-)large deviation rate function. The next examplefurther substantiates this argument.

Example 8.11 Suppose that PrZn = 0 = 1− e−2n, and PrZn = −2n =e−2n. Then from Definition 8.9, we have

ϕn(θ) ,1

nlogE

[eθZn

]=

1

nlog[1− e−2n + e−(θ+1)·2·n] ,

and

ϕ(θ) , lim supn→∞

ϕn(θ) =

0, for θ ≥ −1;−2(θ + 1), for θ < −1.

Hence, θ ∈ < : ϕ(θ) > −∞ = < (i.e., the real line) and

I(x) = supθ∈<

[θx− ϕ(θ)]

= supθ∈<

[θx+ 2(θ + 1)1θ < −1)]

=

−x, for − 2 ≤ x ≤ 0;∞, otherwise,

where 1· represents the indicator function of a set. Consequently, by Theorem8.10,

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ − inf

x∈[a,b]I(x)

=

0, for 0 ∈ [a, b]b, for b ∈ [−2, 0]−∞, otherwise

.

The exponent of PrZn/n ∈ [a, b] in the above example is indeed given by

limn→∞

1

nlogPZn

Znn∈ [a, b]

= − inf

x∈[a,b]I∗(x),

where

I∗(x) =

2, for x = −2;0, for x = 0;∞, otherwise.

(8.3.2)

141

Thus, the upper bound obtained in Theorem 8.10 is not tight.

As mentioned earlier, the looseness of the upper bound in Theorem 8.10 can-not be improved by simply using a convex sup-large deviation rate function. Notethat the true exponent (cf. (8.3.2)) of the above example is not a convex func-tion. We then observe that the convexity of the sup-large deviation rate functionis simply because it is a pointwise supremum of a collection of affine functions(cf. (8.3.1)). In order to obtain a better bound that achieves a non-convex largedeviation rate, the involvement of non-affine functionals seems necessary. Asa result, a new extension of the Gartner-Ellis theorem is established along thisline.

Before introducing the non-affine extension of Gartner-Ellis upper bound, wedefine the twisted sup-large deviation rate function as follows.

Definition 8.12 Define

ϕn(θ;h) ,1

nlogE

[exp

n · θ · h

(Znn

)]and ϕh(θ) , lim sup

n→∞ϕn(θ;h),

where h(·) is a given real-valued continuous function. The twisted sup-largedeviation rate function of an arbitrary random sequence Zn∞n=1 with respectto a real-valued continuous function h(·) is defined as

Jh(x) , supθ∈< : ϕh(θ)>−∞

[θ · h(x)− ϕh(θ)] . (8.3.3)

Similar to I(x), the range of the supremum operation in (8.3.3) is not empty,and hence, Jh(·) is always defined.

Theorem 8.13 Suppose that h(·) is a real-valued continuous function. Thenfor a, b ∈ < and a ≤ b,

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ − inf

x∈[a,b]Jh(x).

Proof: The proof is divided into two parts. Part 1 proves the result under

[a, b] ∩x ∈ < : Jh(x) <∞

6= ∅, (8.3.4)

and Part 2 verifies it under

[a, b] ⊂x ∈ < : Jh(x) =∞

. (8.3.5)

Since either (8.3.4) or (8.3.5) is true, these two parts, together, complete theproof of this theorem.

142

Part 1: Assume [a, b] ∩x ∈ < : Jh(x) <∞

6= ∅.

Define J∗ , infx∈[a,b] Jh(x). By assumption, J∗ <∞. Therefore,

[a, b] ⊂x : Jh(x) > J∗ − ε

,

for any ε > 0, andx ∈ < : Jh(x) > J∗ − ε

=

x ∈ < : sup

θ∈< : ϕh(θ)>−∞[θh(x)− ϕh(θ)] > J∗ − ε

⊂

⋃θ∈< : ϕh(θ)>−∞

x ∈ < : [θh(x)− ϕh(θ)] > J∗ − ε .

Observe that ∪θ∈< : ϕh(θ)>−∞ x ∈ < : [θx− ϕh(θ)] > J∗ − ε is a col-lection of (uncountably infinite) open sets that cover [a, b] which is closedand bounded (and hence compact). By the Heine-Borel theorem, we canfind a finite subcover such that

[a, b] ⊂k⋃i=1

x ∈ < : [θix− ϕh(θi)] > J∗ − ε ,

and (∀ 1 ≤ i ≤ k) ϕh(θi) <∞ (otherwise, the set x : [θix−ϕh(θi)] > J∗−ε is empty, and can be removed). Also note (∀ 1 ≤ i ≤ k) ϕh(θi) > −∞.Consequently,

Pr

Znn∈ [a, b]

⊂ Pr

Znn∈

k⋃i=1

x : θih(x)− ϕh(θi)] > J∗ − ε

≤k∑i=1

Pr

h

(Znn

)· θi − ϕh(θi) > J∗ − ε

=k∑i=1

Pr

n · h

(Znn

)· θi > nϕh(θi) + n(J∗ − ε)

≤k∑i=1

exp n[ϕn(θi;h)− ϕh(θi)]− n(J∗ − ε) ,

where the last step follows from Markov’s inequality. Since k is a constantindependent of n, and for each integer i ∈ [1, k],

lim supn→∞

1

nlog (exp n[ϕn(θi;h)− ϕh(θi)]− n(J∗ − ε)) = −(J∗ − ε),

143

we obtain

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ −(J∗ − ε).

Since ε is arbitrary, the proof is completed.

Part 2: Assume [a, b] ⊂x ∈ < : Jh(x) =∞

.

Observe that [a, b] ⊂x : Jh(x) > L

for any L > 0. Following the same

procedure as used in Part 1, we obtain

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ −L.

Since L can be taken arbitrarily large,

lim supn→∞

1

nlogPr

Znn∈ [a, b]

= −∞ = − inf

x∈[a,b]Jh(x).

2

The proof of the above theorem has implicitly used the condition −∞ <ϕh(θi) < ∞ to guarantee that lim supn→∞[ϕn(θi;h) − ϕh(θi)] = 0 for each inte-ger i ∈ [1, k]. Note that when ϕh(θi) = ∞ (resp. −∞), lim supn→∞[ϕn(θi;h) −ϕh(θi)] = −∞ (resp. ∞). This explains why the range of the supremum opera-tion in (8.3.3) is taken to be θ ∈ < : ϕh(θ) > −∞, instead of the whole realline.

As indicated in Theorem 8.13, a better upper bound can possibly be foundby twisting the large deviation rate function around an appropriate (non-affine)functional on the real line. Such improvement is substantiated in the next ex-ample.

Example 8.14 Let us, again, investigate the Zn∞n=1 defined in Example 8.11.Take

h(x) =1

2(x+ 2)2 − 1.

Then from Definition 8.12, we have

ϕn(θ;h) ,1

nlogE [exp nθh(Zn/n)]

=1

nlog [exp nθ − exp n(θ − 2)+ exp −n(θ + 2)] ,

and

ϕh(θ) , lim supn→∞

ϕn(θ;h) =

−(θ + 2), for θ ≤ −1;θ, for θ > −1.

144

Hence, θ ∈ < : ϕh(θ) > −∞ = < and

Jh(x) , supθ∈<

[θh(x)− ϕh(θ)] =

−1

2(x+ 2)2 + 2, for x ∈ [−4, 0];

∞, otherwise.

Consequently, by Theorem 8.13,

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ − inf

x∈[a,b]Jh(x)

=

−min

−(a+ 2)2

2,−(b+ 2)2

2

− 2, for − 4 ≤ a < b ≤ 0;

0, for a > 0 or b < −4;−∞, otherwise.

(8.3.6)

For b ∈ (−2, 0) and a ∈[−2−

√2b− 4, b

), the upper bound attained in the

previous example is strictly less than that given in Example 8.11, and hence,an improvement is obtained. However, for b ∈ (−2, 0) and a < −2 −

√2b− 4,

the upper bound in (8.3.6) is actually looser. Accordingly, we combine the twoupper bounds from Examples 8.11 and 8.14 to get

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ −max

inf

x∈[a,b]Jh(x), inf

x∈[a,b]I(x)

=

0, for 0 ∈ [a, b];1

2(b+ 2)2 − 2, for b ∈ [−2, 0];

−∞, otherwise.

A better bound on the exponent of PrZn/n ∈ [a, b] is thus obtained. As aresult, Theorem 8.13 can be further generalized as follows.


lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ − inf

x∈[a,b]J(x),

where J(x) , suph∈H Jh(x) and H is the set of all real-valued continuous func-tions.

Proof: By re-defining J∗ , infx∈[a,b] J(x) in the proof of Theorem 8.13, andobserving that

[a, b] ⊂ x ∈ < : J(x) > J∗ − ε⊂

⋃h∈H

⋃θ∈< : ϕh(θ)>−∞

x ∈ < : [θh(x)− ϕh(θ)] > J∗ − ε,

145

the theorem holds under [a, b] ∩ x ∈ < : J(x) < ∞ 6= ∅. Similar modifica-tions to the proof of Theorem 8.13 can be applied to the case of [a, b] ⊂ x ∈< : J(x) =∞. 2

Example 8.16 Let us again study the Zn∞n=1 in Example 8.11 (also in Ex-ample 8.14). Suppose c > 1. Take hc(x) = c1(x+ c2)2 − c, where

c1 ,c+√c2 − 1

2and c2 ,

2√c+ 1√

c+ 1 +√c− 1

.

Then from Definition 8.12, we have

ϕn(θ;hc) ,1

nlogE

[exp

nθhc

(Znn

)]=

1

nlog [exp nθ − exp n(θ − 2)+ exp −n(θ + 2)] ,

and

ϕhc(θ) , lim supn→∞

ϕn(θ;hc) =

−(θ + 2), for θ ≤ −1;θ, for θ > −1.

Hence, θ ∈ < : ϕhc(θ) > −∞ = < and

Jhc(x) = supθ∈<

[θhc(x)− ϕhc(θ)]

=

−c1(x+ c2)2 + c+ 1, for x ∈ [−2c2, 0];∞, otherwise.

From Theorem 8.15,

J(x) = suph∈H

Jh(x) ≥ maxlim infc→∞

Jhc(x), I(x) = I∗(x),

where I∗(x) is defined in (8.3.2). Consequently,

lim supn→∞

1

nlogPr

Znn∈ [a, b]

≤ − inf

x∈[a,b]J(x)

≤ − infx∈[a,b]

I∗(x)

=

0, if 0 ∈ [a, b];−2, if − 2 ∈ [a, b] and 0 6∈ [a, b];−∞, otherwise,

and a tight upper bound is finally obtained!

146

Theorem 8.13 gives us the upper bound on the limsup of (1/n) logPrZn/n ∈[a, b]. With the same technique, we can also obtain a parallel theorem for thequantity

lim infn→∞

1

nlogPr

Znn∈ [a, b]

.

Definition 8.17 Define ϕh(θ) , lim infn→∞ ϕn(θ;h), where ϕn(θ;h) was de-

fined in Definition 8.12. The twisted inf-large deviation rate function of an arbi-trary random sequence Zn∞n=1 with respect to a real-valued continuous functionh(·) is defined as

Jh(x) , supθ∈< : ϕ

h(θ)>−∞

[θ · h(x)− ϕ

h(θ)].


lim infn→∞

1

nlogPr

Znn∈ [a, b]

≤ − inf

x∈[a,b]J(x),

where J(x) , suph∈H Jh(x) and H is the set of all real-valued continuous func-tions.

8.3.2 Extension of Gartner-Ellis lower bounds

The tightness of the upper bound given in Theorem 8.13 naturally relies on thevalidity of

lim supn→∞

1

nlogPr

Znn∈ (a, b)

≥ − inf

x∈(a,b)Jh(x), (8.3.7)

which is an extension of the Gartner-Ellis lower bound. The above inequality,however, is not in general true for all choices of a and b (cf. Case A of Exam-ple 8.21). It therefore becomes significant to find those (a, b) within which theextended Gartner-Ellis lower bound holds.

Definition 8.19 Define the sup-Gartner-Ellis set with respect to a real-valuedcontinuous function h(·) as

Dh ,⋃

θ∈< : ϕh(θ)>−∞

D(θ;h)

147

where

D(θ;h) ,

x ∈ < : lim sup

t↓0

ϕh(θ + t)− ϕh(θ)t

≤ h(x) ≤ lim inft↓0

ϕh(θ)− ϕh(θ − t)t

.

Let us briefly remark on the sup-Gartner-Ellis set defined above. It is self-explained in its definition that Dh is always defined for any real-valued functionh(·). Furthermore, it can be derived that the sup-Gartner-Ellis set is reduced to

Dh ,⋃

θ∈< : ϕh(θ)>−∞

x ∈ < : ϕ′h(θ) = h(x) ,

if the derivative ϕ′h(θ) exists for all θ. Observe that the condition h(x) = ϕ′h(θ)is exactly the equation for finding the θ that achieves Jh(x), which is obtainedby taking the derivative of [θh(x) − ϕh(θ)]. This somehow hints that the sup-Gartner-Ellis set is a collection of those points at which the exact sup-largedeviation rate is achievable.

We now state the main theorem in this section.

Theorem 8.20 Suppose that h(·) is a real-valued continuous function. Then if(a, b) ⊂ Dh,

lim supn→∞

1

nlogPr

Znn∈ Jh(a, b)

≥ − inf

x∈(a,b)Jh(x),

whereJh(a, b) , y ∈ < : h(y) = h(x) for some x ∈ (a, b).

Proof: Let Fn(·) denote the distribution function of Zn. Define its extendedtwisted distribution around the real-valued continuous function h(·) as

dF (θ;h)n (·) , exp n θ h(x/n) dFn(·)

E[exp n θ h(Zn/n)]=

exp n θ h(x/n) dFn(·)exp nϕn(θ;h)

.

Let Z(θ:h)n be the random variable having F

(θ;h)n (·) as its probability distribution.

LetJ∗ , inf

x∈(a,b)Jh(x).

Then for any ε > 0, there exists v ∈ (a, b) with Jh(v) ≤ J∗ + ε.

148

Now the continuity of h(·) implies that

B(v, δ) , x ∈ < : |h(x)− h(v)| < δ ⊂ Jh(a, b)

for some δ > 0. Also, (a, b) ⊂ Dh ensures the existence of θ satisfying

lim supt↓0


≤ h(v) ≤ lim inft↓0

ϕh(θ)− ϕh(θ − t)t

,

which in turn guarantees the existence of t = t(δ) > 0 satisfying


≤ h(v) +δ

4and h(v)− δ

4≤ ϕh(θ)− ϕh(θ − t)

t. (8.3.8)

We then derive

Pr

Znn∈ Jh(a, b)

≥ Pr

Znn∈ B(v, δ)

= Pr

∣∣∣∣h(Znn)− h(v)

∣∣∣∣ < δ

=

∫x∈< : |h(x/n)−h(v)|<δ

dFn(x)

=

∫x∈< : |h(x/n)−h(v)|<δ

expnϕn(θ;h)− nθh

(xn

)dF (θ;h)

n (x)

≥ exp nϕn(θ;h)− nθh(v)− n|θ|δ∫x∈< : |h(x/n)−h(v)|<δ

dF (θ;h)n (x)

= exp nϕn(θ;h)− nθh(v)− n|θ|δPr

Z

(θ;h)n

n∈ B(v, δ)

,

which implies

lim supn→∞

1

nlogPr

Znn∈ Jh(a, b)

≥ −[θh(v)− ϕh(θ)]− |θ| δ + lim sup

n→∞

1

nlogPr

Z

(θ;h)n

n∈ B(v, δ)

= −Jh(v)− |θ| δ + lim supn→∞

1

nlogPr

Z

(θ;h)n

n∈ B(v, δ)

≥ −J∗ − ε− |θ| δ + lim supn→∞

1

nlogPr

Z

(θ;h)n

n∈ B(v, δ)

.

149

Since both δ and ε can be made arbitrarily small, it remains to show

lim supn→∞

1

nlogPr

Z

(θ;h)n

n∈ B(v, δ)

= 0. (8.3.9)

To show (8.3.9), we first note that

Pr

h

(Z

(θ;h)n

n

)≥ h(v) + δ

= Pr

en t h

(Z

(θ;h)n /n

)≥ en t h(v)+n t δ

≤ e−n t h(v)−n t δ

∫<en t h(x/n)dF (θ;h)

n

= e−n t h(v)−n t δ∫<en t h(x/n)+n θ h(x/n)−nϕn(θ;h)dFn(x)

= e−nth(v)−ntδ−nϕn(θ;h)+nϕn(θ+t;h).

Similarly,

Pr

h

(Z

(θ;h)n

n

)≤ h(v)− δ

= Pr

e−n t h

(Z

(θ;h)n /n

)≥ e−n t h(v)+n t δ

≤ en t h(v)−n t δ

∫<e−n t h(x/n)dF (θ;h)

n

= en t h(v)−n t δ∫<e−n t h(x/n)+n θ h(x/n)−nϕn(θ;h)dFn(x)

= enth(v)−ntδ−nϕn(θ;h)+nϕn(θ−t;h).

Now by definition of limsup,

ϕn(θ + t;h) ≤ ϕh(θ + t) +tδ

4and ϕn(θ − t;h) ≤ ϕh(θ − t) +

tδ

4(8.3.10)

for sufficiently large n; and

ϕn(θ;h) ≥ ϕh(θ)−tδ

4(8.3.11)

150

for infinitely many n. Hence, there exists a subsequence n1, n2, n3, . . . suchthat for all nj, (8.3.10) and (8.3.11) hold. Consequently, for all j,

1

njlogPr

Z

(θ;h)nj

nj6∈ B(v, δ)

≤ 1

njlog(

2×max[e−njth(v)−njtδ−njϕnj (θ;h)+njϕnj (θ+t;h),

enjth(v)−njtδ−njϕnj (θ;h)+njϕnj (θ−t;h)])

=1

njlog 2 + max

[−th(v) + ϕnj(θ + t;h), th(v) + ϕnj(θ − t;h)

]−ϕnj(θ;h)− tδ

≤ 1

njlog 2 + max [−th(v) + ϕh(θ + t), th(v) + ϕh(θ − t)] − ϕh(θ)−

tδ

2

=1

njlog 2

+t ·max

[ϕh(θ + t)− ϕh(θ)

t− h(v), h(v)− ϕh(θ)− ϕh(θ − t)

t

]−tδ

2

≤ 1

njlog 2− tδ

4, (8.3.12)

where (8.3.12) follows from (8.3.8). The proof is then completed by obtaining

lim infn→∞

1

nlogPr

Z

(θ;h)n

n6∈ B(v, δ)

≤ −tδ

4,

which immediately guarantees the validity of (8.3.9). 2

Next, we use an example to demonstrate that by choosing the right h(·),we can completely characterize the exact (non-convex) sup-large deviation rateI∗(x) for all x ∈ <.

Example 8.21 Suppose Zn = X1 + · · ·+Xn, where Xini=1 are i.i.d. Gaussianrandom variables with mean 1 and variance 1 if n is even, and with mean −1and variance 1 if n is odd. Then the exact large deviation rate formula I∗(x)that satisfies for all a < b,

− infx∈[a,b]

I∗(x) ≥ lim supn→∞

1

nlogPr

Znn∈ [a, b]

≥ lim sup

n→∞

1

nlogPr

Znn∈ (a, b)

≥ − inf

x∈(a,b)I∗(x)

151

is

I∗(x) =(|x| − 1)2

2. (8.3.13)

Case A: h(x) = x.

For the affine h(·), ϕn(θ) = θ + θ2/2 when n is even, and ϕn(θ) =−θ + θ2/2 when n is odd. Hence, ϕ(θ) = |θ|+ θ2/2, and

Dh =

(⋃θ>0

v ∈ < : v = 1 + θ

)⋃(⋃θ<0

v ∈ < : v = −1 + θ

)⋃v ∈ < : 1 ≤ v ≤ −1

= (1,∞) ∪ (−∞,−1).

Therefore, Theorem 8.20 cannot be applied to any a and b with (a, b) ∩[−1, 1] 6= ∅.By deriving

I(x) = supθ∈<xθ − ϕ(θ) =

(|x| − 1)2

2, for |x| > 1;

0, for |x| ≤ 1,

we obtain for any a ∈ (−∞, 1) ∪ (1,∞),

limε↓0

lim supn→∞

1

nlogPr

Znn∈ (a− ε, a+ ε)

≥ − lim

ε↓0inf

x∈(a−ε,a+ε)I(x) = −(|a| − 1)2

2,

which can be shown tight by Theorem 8.13 (or directly by (8.3.13)). Notethat the above inequality does not hold for any a ∈ (−1, 1). To fill thegap, a different h(·) must be employed.

Case B: h(x) = |x− a| for −1 < a < 1.

152

For n even,

E[enθh(Zn/n)

]= E

[enθ|Zn/n−a|

]=

∫ na

−∞e−θx+nθa 1√

2πne−(x−n)2/(2n)dx+

∫ ∞na

eθx−nθa1√2πn

e−(x−n)2/(2n)dx

= enθ(θ−2+2a)/2

∫ na

−∞

1√2πn

e−[x−n(1−θ)]2/(2n)dx

+ enθ(θ+2−2a)/2

∫ ∞na

1√2πn

e−[x−n(1+θ)]2/(2n)dx

= enθ(θ−2+2a)/2 · Φ((θ + a− 1)

√n)

+ enθ(θ+2−2a)/2 · Φ((θ − a+ 1)

√n),

where Φ(·) represents the unit Gaussian cdf.

Similarly, for n odd,

E[enθh(Zn/n)

]= enθ(θ+2+2a)/2 · Φ

((θ + a+ 1)

√n)

+ enθ(θ−2−2a)/2 · Φ((θ − a− 1)

√n).

Observe that for any b ∈ <,

limn→∞

1

nlog Φ(b

√n) =

0, for b ≥ 0;

−b2

2, for b < 0.

Hence,

ϕh(θ) =

−(|a| − 1)2

2, for θ < |a| − 1;

θ[θ + 2(1− |a|)]2

, for |a| − 1 ≤ θ < 0;

θ[θ + 2(1 + |a|)]2

, for θ ≥ 0.

Therefore,

Dh =

(⋃θ>0

x ∈ < : |x− a| = θ + 1 + |a|

)⋃(⋃

θ<0

x ∈ < : |x− a| = θ + 1− |a|

)= (−∞, a− 1− |a|) ∪ (a− 1 + |a|, a+ 1− |a|) ∪ (a+ 1 + |a|,∞)

153

and

Jh(x) =

(|x− a| − 1 + |a|)2

2, for a− 1 + |a| < x < a+ 1− |a|;

(|x− a| − 1− |a|)2

2, for x > a+ 1 + |a| or x < a− 1− |a|;

0, otherwise.(8.3.14)

We then apply Theorem 8.20 to obtain

limε↓0

lim supn→∞

1

nlogPr

Znn∈ (a− ε, a+ ε)

≥ − lim

ε↓0inf

x∈(a−ε,a+ε)Jh(x)

= − limε↓0

(ε− 1 + |a|)2

2= −(|a| − 1)2

2.

Note that the above lower bound is valid for any a ∈ (−1, 1), and can beshown tight, again, by Theorem 8.13 (or directly by (8.3.13)).

Finally, by combining the results of Cases A) and B), the true large devi-ation rate of Znn≥1 is completely characterized.

2

Remarks.

• One of the problems in applying the extended Gartner-Ellis theorems is thedifficulty in choosing an appropriate real-valued continuous function (not tomention the finding of the optimal one in the sense of Theorem 8.15). Fromthe previous example, we observe that the resultant Jh(x) is in fact equal tothe lower convex contour1 (with respect to h(·)) of miny∈< : h(y)=h(x) I

∗(x).Indeed, if the lower convex contour of miny∈< : h(y)=h(x) I

∗(x) equals I∗(x)for some x lying in the interior of Dh, we can apply Theorems 8.13 and8.20 to establish the large deviation rate at this point. From the aboveexample, we somehow sense that taking h(x) = |x − a| is advantageousin characterizing the large deviation rate at x = a. As a consequenceof such choice of h(·), Jh(x) will shape like the lower convex contour ofminI∗(x − a), I∗(a − x) in h(x) = |x − a|. Hence, if a lies in Dh, Jh(a)can surely be used to characterize the large deviation rate at x = a (as itdoes in Case B of Example 8.21).

1We define that the lower convex contour of a function f(·) with respect to h(·) is thelargest g(·) satisfying that g(h(x)) ≤ f(x) for all x, and for every x, y and for all λ ∈ [0, 1],λg(h(x)) + (1− λ)g(h(y)) ≥ g(λh(x) + (1− λ)h(y) ).

154

• The assumptions required by the conventional Gartner-Ellis lower bound[2, pp. 15] are

1. ϕ(θ) = ϕ(θ) = ϕ(θ) exists;

2. ϕ(θ) is differentiable on its domain; and

3. (a, b) ⊂ x ∈ < : x = ϕ′(θ) for some θ.

The above assumptions are somewhat of limited use for arbitrary randomsequences, since they do not in general hold. For example, the conditionof ϕ(θ) 6= ϕ(θ) is violated in Example 8.21.

• By using the limsup and liminf operators in our extension theorem, thesup-Gartner-Ellis set is always defined without any requirement on thelog-moment generating functions. The sup-Gartner-Ellis set also clearlyindicates the range in which the Gartner-Ellis lower bound holds. In otherwords, Dh is a subset of the union of all (a, b) for which the Gartner-Ellislower bound is valid. This is concluded in the following equation:

Dh ⊂⋃

(a, b) : lim supn→∞

1

nlogPr

[Znn∈ Jh(a, b)

]≥ − inf

x∈(a,b)Jh(x)

.

To verify whether or not the above two sets are equal merits further inves-tigation.

• Modifying the proof of Theorem 8.20, we can also establish a lower boundfor

lim infn→∞

1

nlogPr

Znn∈ Jh(a, b)

.

Definition 8.22 Define the inf-Gartner-Ellis set with respect to a real-valued continuous function h(·) as

Dh ,⋃

θ∈< : ϕh

(θ)>−∞

D(θ;h)

where

D(θ;h) ,

x ∈ < : lim sup

t↓0

ϕh(θ + t)− ϕ

h(θ)

t≤ h(x)

≤ lim inft↓0

ϕh(θ)− ϕ

h(θ − t)

t

.

155

Theorem 8.23 Suppose that h(·) is a real-valued continuous function.Then if (a, b) ⊂ Dh,

lim infn→∞

1

nlogPr

Znn∈ Jh(a, b)

≥ − inf

x∈(a,b)Jh(x).

• One of the important usages of the large deviation rate functions is to findthe Varadhan’s asymptotic integration formula of

limn→∞

1

nlogE [exp θZn]

for a given random sequence Zn∞n=1. To be specific, it is equal [4, Theo-rem 2.1.10] to

limn→∞

1

nlogE [exp θZn] = sup

x∈< : I(x)<∞[θx− I(x)],

if

limL→∞

lim supn→∞

1

nlog

[∫[x∈< : θx≥L]

exp θx dPZn(x)

]= −∞.

The above result can also be extended using the same idea as applied tothe Gartner-Ellis theorem.

Theorem 8.24 If

limL→∞

lim supn→∞

1

nlog

[∫[x∈< : θh(x)≥L]

exp θh(x) dPZn(x)

]= −∞,

then

lim supn→∞

1

nlogE

[exp

nθh

(Znn

)]= supx∈< : Jh(x)<∞

[θh(x)− Jh(x)],

and

lim infn→∞

1

nlog

[exp

nθh

(Znn

)]= supx∈< : Jh(x)<∞

[θh(x)− Jh(x)].

Proof: This can be obtained by modifying the proofs of Lemmas 2.1.7and 2.1.8 in [4]. 2

We close the section by remarking that the result of the above theoremcan be re-formulated as

Jh(x) = supθ∈< : ϕh(θ)>−∞

[θh(x)− ϕh(θ)]

156

andϕh(θ) = sup

x∈< : Jh(x)<∞[θh(x)− Jh(x)],

which is an extension of the Legendre-Fenchel transform pair. Similarconclusion applies to Jh(x) and ϕ

h(θ).

8.3.3 Properties of (twisted) sup- and inf-large deviationrate functions.

Property 8.25 Let I(x) and I(x) be the sup- and inf- large deviation rate func-tions of an infinite sequence of arbitrary random variables Zn∞n=1, respectively.Denote mn = (1/n)E[Zn]. Let m , lim supn→∞mn and m , lim infn→∞mn.Then

1. I(x) and I(x) are both convex.

2. I(x) is continuous over x ∈ < : I(x) <∞. Likewise, I(x) is continuousover x ∈ < : I(x) <∞.

3. I(x) gives its minimum value 0 at m ≤ x ≤ m.

4. I(x) ≥ 0. But I(x) does not necessary give its minimum value at bothx = m and x = m.

Proof:

1. I(x) is the pointwise supremum of a collection of affine functions. Therefore,it is convex. Similar argument can be applied to I(x).

2. A convex function on the real line is continuous everywhere on its domainand hence the property holds.

3.&4. The proofs follow immediately from Property 8.26 by taking h(x) = x.2

Since the twisted sup/inf-large deviation rate functions are not necessarilyconvex, a few properties of sup/inf-large deviation functions do not hold forgeneral twisted functions.

Property 8.26 Suppose that h(·) is a real-valued continuous function. LetJh(x) and Jh(x) be the corresponding twisted sup- and inf- large deviation ratefunctions, respectively. Denote mn(h) , E[h(Zn/n)]. Let

mh , lim supn→∞

mn(h) and mh , lim infn→∞

mn(h).

Then

157

1. Jh(x) ≥ 0, with equality holds if mh ≤ h(x) ≤ mh.

2. Jh(x) ≥ 0, but Jh(x) does not necessary give its minimum value at bothx = mh and x = mh.

Proof:

1. For all x ∈ <,

Jh(x) , supθ∈< : ϕh(θ)>−∞

[θ · h(x)− ϕh(θ)] ≥ 0 · h(x)− ϕh(0) = 0.

By Jensen’s inequality,

exp nϕn(θ;h) = E[exp n · θ · h(Zn/n)]≥ exp n · θ · E[h(Zn/n)]= exp n · θ ·mn(h) ,

which is equivalent toθ ·mn(h) ≤ ϕn(θ;h).

After taking the limsup and liminf of both sides of the above inequalities,we obtain:

• for θ ≥ 0,θmh ≤ ϕh(θ) (8.3.15)

andθ ·mh ≤ ϕ

h(θ) ≤ ϕh(θ); (8.3.16)

• for θ < 0,θmh ≤ ϕh(θ) (8.3.17)

andθ · mh ≤ ϕ

h(θ) ≤ ϕh(θ). (8.3.18)

(8.3.15) and (8.3.18) imply Jh(x) = 0 for those x satisfying h(x) = mh,and (8.3.16) and (8.3.17) imply Jh(x) = 0 for those x satisfying h(x) = mh.For x ∈ x : mh ≤ h(x) ≤ mh,

θ · h(x)− ϕh(θ) ≤ θ · mh − ϕh(θ) ≤ 0 for θ ≥ 0

andθ · h(x)− ϕh(θ) ≤ θ ·mh − ϕh(θ) ≤ 0 for θ < 0.

Hence, by taking the supremum over θ ∈ < : ϕh(θ) > −∞, we obtainthe desired result.

158

2. The non-negativity of Jh(x) can be similarly proved as Jh(x).

From Case A of Example 8.21, we have m = 1, m = −1, and

ϕ(θ) =

−θ +

1

2θ2, for θ ≥ 0

θ +1

2θ2, for θ < 0

Therefore,

I(x) = supθ∈<xθ − ϕ(θ) =

(x+ 1)2

2, for x ≥ 0

(x− 1)2

2, for x < 0,

.

for which I(−1) = I(1) = 2 and minx∈< I(x) = I(0) = 1/2. Conse-quently, Ih(x) neither equals zero nor gives its minimum value at bothx = mh and x = mh. 2

8.4 Probabilitic subexponential behavior

In the previous section, we already demonstrated the usage of large deviationstechniques to compute the exponent of an exponentially decayed probability.However, no effort so far is placed on studying its subexponential behavior. Ob-serve that the two sequences, an = (1/n) exp−2n and bn = (1/

√n) exp−2n,

have the same exponent, but contain different subexponential terms. Such subex-ponential terms actually affect their respective rate of convergence when n is notlarge. Let us begin with the introduction of (a variation of) the Berry-Esseentheorem [5, sec.XVI. 5], which is later applied to evaluate the subexponentialdetail of a desired probability.

8.4.1 Berry-Esseen theorem for compound i.i.d. sequence

The Berry-Esseen theorem [5, sec.XVI. 5] states that the distribution of the sumof independent zero-mean random variables Xini=1, normalized by the standarddeviation of the sum, differs from the Gaussian distribution by at most C rn/s

3n,

where s2n and rn are respectively sums of the marginal variances and the marginal

absolute third moments, and C is an absolute constant. Specifically, for everya ∈ <, ∣∣∣∣Pr 1

sn(X1 + · · ·+Xn) ≤ a

− Φ(a)

∣∣∣∣ ≤ Crns3n

, (8.4.1)

where Φ(·) represents the unit Gaussian cdf. The striking feature of this theoremis that the upper bound depends only on the variance and the absolute third

159

moment, and hence, can provide a good asymptotic estimate based on only thefirst three moments. The absolute constant C is commonly 6 [5, sec. XVI. 5,Thm. 2]. When Xnni=1 are identically distributed, in addition to independent,the absolute constant can be reduced to 3, and has been reported to be improveddown to 2.05 [5, sec. XVI. 5, Thm. 1].

The samples that we concern in this section actually consists of two i.i.d. se-quences (and, is therefore named compound i.i.d. sequence.) Hence, we need tobuild a Berry-Esseen theorem based on compound i.i.d. samples. We begin withthe introduction of the smoothing lemma.

Lemma 8.27 Fix the bandlimited filtering function

vT(x) ,

1− cos(Tx)

πTx2.

For any cumulative distribution function H(·) on the real line <,

supx∈<|∆T (x)| ≥ 1

2η − 6

Tπ√

2πh

(T√

2π

2η

),

where

∆T (t) ,∫ ∞−∞

[H(t− x)− Φ(t− x)]× vT(x)dx, η , sup

x∈<|H(x)− Φ(x)| ,

and

h(u) ,

u

∫ ∞u

1− cos(x)

x2dx

=π

2u+ 1− cos(u)− u

∫ u

0

sin(x)

xdx, if u ≥ 0;

0, otherwise.

Proof: The right-continuity of the cdf H(·) and the continuity of the Gaussianunit cdf Φ(·) together indicate the right-continuity of |H(x)− Φ(x)|, which inturn implies the existence of x0 ∈ < satisfying

either η = |H(x0)− Φ(x0)| or η = limx↑x0|H(x)− Φ(x)| > |H(x0)− Φ(x0)| .

We then distinguish between three cases:

Case A) η = H(x0)− Φ(x0)

Case B) η = Φ(x0)−H(x0)

Case C) η = limx↑x0|H(x)− Φ(x)| > |H(x0)− Φ(x0)| .

160

Case A) η = H(x0)− Φ(x0). In this case, we note that for s > 0,

H(x0 + s)− Φ(x0 + s) ≥ H(x0)−[Φ(x0) +

s√2π

](8.4.2)

= η − s√2π, (8.4.3)

where (8.4.2) follows from supx∈< |Φ′(x)| = 1/√

2π. Observe that (8.4.3)implies

H

(x0 +

√2π

2η − x

)− Φ

(x0 +

√2π

2η − x

)

≥ η − 1√2π

(√2π

2η − x

)=

η

2+

x√2π,

for |x| < η√

2π/2. Together with the fact that H(x) − Φ(x) ≥ −η for allx ∈ <, we obtain

supx∈<|∆T (x)|

≥ ∆T

(x0 +

√2π

2η

)

=

∫ ∞−∞

[H

(x0 +

√2π

2η − x

)− Φ

(x0 +

√2π

2η − x

)]× v

T(x)dx

=

∫[|x|<η

√2π/2]

[H

(x0 +

√2π

2η − x

)− Φ

(x0 +

√2π

2η − x

)]vT(x)dx

+

∫[|x|≥η

√2π/2]

[H

(x0 +

√2π

2η − x

)− Φ

(x0 +

√2π

2η − x

)]vT(x)dx

≥∫

[|x|<η√

2π/2]

[η

2+

x√2π

]× v

T(x)dx+

∫[|x|≥η

√2π/2]

(−η)× vT(x)dx

=

∫[|x|<η

√2π/2]

η

2× v

T(x)dx+

∫[|x|≥η

√2π/2]

(−η)× vT(x)dx, (8.4.4)

where the last equality holds because of the symmetry of the filtering func-

161

tion, vT(·). The quantity of

∫[|x|≥η

√2π/2] vT (x)dx can be derived as follows:∫

[|x|≥η√

2π/2]vT(x)dx = 2

∫ ∞η√

2π/2

vT(x)dx

=2

π

∫ ∞η√

2π/2

1− cos(Tx)

Tx2dx

=2

π

∫ ∞ηT√

2π/2

1− cos(x)

x2dx

=4

ηTπ√

2πh

(T√

2π

2η

).

Continuing from (8.4.4),

supx∈<|∆T (x)|

≥ η

2

[1− 4

ηTπ√

2πh

(T√

2π

2η

)]− η ·

[4

ηTπ√

2πh

(T√

2π

2η

)]

=1

2η − 6

Tπ√

2πh

(T√

2π

2η

).

Case B) η = Φ(x0)−H(x0). Similar to case A), we first derive for s > 0,

Φ(x0 − s)−H(x0 − s) ≥[Φ(x0)− s√

2π

]−H(x0) = η − s√

2π,

and then obtain

H

(x0 −

√2π

2η − x

)− Φ

(x0 −

√2π

2η − x

)

≥ η − 1√2π

(√2π

2η + x

)=

η

2− x√

2π,

for |x| < η√

2π/2. Together with the fact that H(x) − Φ(x) ≥ −η for all

162

x ∈ <, we obtain

supx∈<|∆T (x)|

≥ ∆T

(x0 −

√2π

2η

)

≥∫

[|x|<η√

2π/2]

[η

2− x√

2π

]× v

T(x)dx+

∫[|x|≥η

√2π/2]

(−η)× vT(x)dx

=

∫[|x|<η

√2π/2]

η

2× v

T(x)dx+

∫[|x|≥η

√2π/2]

(−η)× vT(x)dx

=1

2η − 6

Tπ√

2πh

(T√

2π

2η

).

Case C) η = limx↑x0 |H(x)− Φ(x)| > |H(x0)− Φ(x0)|.In this case, we claim that for s > 0, either

H(x0 + s)− Φ(x0 + s) ≥ η − s√2π

orΦ(x0 − s)−H(x0 − s) ≥ η − s√

2π

holds, because if it were not true, thenH(x0 + s)− Φ(x0 + s) < η − s√

2πΦ(x0 − s)−H(x0 − s) < η − s√

2π

⇒

lims↓0

H(x0 + s)− Φ(x0) ≤ η

Φ(x0)− lims↓0

H(x0 − s) ≤ η

which immediately gives us H(x0) = Φ(x0) − η, contradicting to thepremise of the current case. Following this claim, the desired result canbe proved using the procedure of either case A) or case B); and hence, weomit it. 2

163

Lemma 8.28 For any cumulative distribution function H(·) with characteristicfunction ϕ

H(ζ),

η ≤ 1

π

∫ T

−T

∣∣∣ϕH (ζ)− e−(1/2)ζ2∣∣∣ dζ|ζ| +

12

Tπ√

2πh

(T√

2π

2η

),

where η and h(·) are defined in Lemma 8.27.

Proof: Observe that ∆T (t) in Lemma 8.27 is nothing but a convolution of vT(·)

and H(x)− Φ(x). The characteristic function (or Fourier transform) of vT(·) is

ωT(ζ) =

1− |ζ|

T, if |ζ| ≤ T ;

0, otherwise.

By the Fourier inversion theorem [5, sec. XV.3],

d (∆T (x))

dx=

1

2π

∫ ∞−∞

e−jζx[ϕH

(ζ)− e−(1/2)ζ2]ωT(ζ)dζ.

Integrating with respect to x, we obtain

∆T (x) =1

2π

∫ T

−Te−jζx

[ϕH

(ζ)− e−(1/2)ζ2]

−jζωT(ζ)dζ.

Accordingly,

supx∈<|∆T (x)| = sup

x∈<

1

2π

∣∣∣∣∣∣∫ T

−Te−jζx

[ϕH

(ζ)− e−(1/2)ζ2]

−jζωT(ζ)dζ

∣∣∣∣∣∣≤ sup

x∈<

1

2π

∫ T

−T

∣∣∣∣∣∣e−jζx[ϕH

(ζ)− e−(1/2)ζ2]

−jζωT(ζ)

∣∣∣∣∣∣ dζ= sup

x∈<

1

2π

∫ T

−T

∣∣∣ϕH (ζ)− e−(1/2)ζ2∣∣∣ · |ωT (ζ)|dζ

|ζ|

≤ supx∈<

1

2π

∫ T

−T

∣∣∣ϕH (ζ)− e−(1/2)ζ2∣∣∣ dζ|ζ|

=1

2π

∫ T

−T

∣∣∣ϕH (ζ)− e−(1/2)ζ2∣∣∣ dζ|ζ| .

Together with the result in Lemma 8.27, we finally have

η ≤ 1

π

∫ T

−T

∣∣∣ϕH (ζ)− e−(1/2)ζ2∣∣∣ dζ|ζ| +

12

Tπ√

2πh

(T√

2π

2η

).

2

164

Theorem 8.29 (Berry-Esseen theorem for compound i.i.d. sequences)Let Yn =

∑ni=1 Xi be the sum of independent random variables, among which

Xidi=1 are identically Gaussian distributed, and Xini=d+1 are identically dis-tributed but not necessarily Gaussian. Denote the mean-variance pair of X1 andXd+1 by (µ, σ2) and (µ, σ2), respectively. Define

ρ , E[|X1 − µ|3

], ρ , E

[|Xd+1 − µ|3

], and s2

n = Var[Yn] = σ2d+ σ2(n− d).

Also denote the cdf of (Yn − E[Yn])/sn by Hn(·). Then for all y ∈ <,

|Hn(y)− Φ(y)| ≤ Cn,d2√π

(n− d− 1)(2(n− d)− 3

√2) ρ

σ2sn,

where Cn,d is the unique positive number satisfying

π

6Cn,d − h(Cn,d) =

√π(2(n− d)− 3

√2)

12(n− d− 1)

×

( √6π

2(3−√

2)3/2

+9

2(11− 6√

2)√n− d

),

provided that n− d ≥ 3.

Proof: From Lemma 8.28,

π ·η ≤∫ T

−T

∣∣∣∣ϕd( ζ

sn

)ϕn−d

(ζ

sn

)− e−(1/2)ζ2

∣∣∣∣ dζ|ζ|+ 12

T√

2πh

(T√

2π

2η

), (8.4.5)

where ϕ(·) and ϕ(·) are respectively the characteristic functions of (X1−µ) and(Xd+1 − µ). Observe that the integrand satisfies∣∣∣∣ϕd( ζ

sn

)ϕn−d

(ζ

sn

)− e−(1/2)ζ2

∣∣∣∣=

∣∣∣∣ϕn−d( ζ

sn

)− e−(1/2)(σ2(n−d)/s2n)ζ2

∣∣∣∣ · e−(1/2)(σ2d/s2n)ζ2

≤ (n− d)

∣∣∣∣ϕ( ζ

sn

)− e−(1/2)(σ2/s2n)ζ2

∣∣∣∣ · e−(1/2)(σ2d/s2n)ζ2γn−d−1, (8.4.6)

≤ (n− d)

∣∣∣∣ϕ( ζ

sn

)−(

1− σ2

2s2n

ζ2

)∣∣∣∣+

∣∣∣∣(1− σ2

2s2n

ζ2

)− e−(1/2)(σ2/s2n)ζ2

∣∣∣∣·e−(1/2)(σ2d/s2n)ζ2γn−d−1 (8.4.7)

where the quantity γ in (8.4.6) requires [5, sec. XVI. 5, (5.5)] that∣∣∣∣ϕ( ζ

sn

)∣∣∣∣ ≤ γ and∣∣∣e−(1/2)(σ2/s2n)ζ2

∣∣∣ ≤ γ.

165

By equation (26.5) in [1], we upperbound the first and second terms in the bracesof (8.4.7) respectively by∣∣∣∣ϕ( ζ

sn

)− 1 +

σ2

2s2n

ζ2

∣∣∣∣ ≤ ρ

6s3n

|ζ|3 and

∣∣∣∣1− σ2

2s2n

ζ2 − e−(1/2)(σ2/s2n)ζ2∣∣∣∣ ≤ σ4

8s4n

ζ4.

Continuing the derivation of (8.4.7),∣∣∣∣ϕd( ζ

sn

)ϕn−d

(ζ

sn

)− e−(1/2)ζ2

∣∣∣∣≤ (n− d)

(ρ

6s3n

|ζ|3 +σ4

8s4n

ζ4

)· e−(1/2)(σ2d/s2n)ζ2γn−d−1. (8.4.8)

It remains to choose γ that bounds both |ϕ(ζ/sn)| and exp−(1/2) (σ2/s2n) ζ2

from above.

From the elementary property of characteristic functions,∣∣∣∣ϕ( ζ

sn

)∣∣∣∣ ≤ 1− σ2

2s2n

ζ2 +ρ

6s3n

|ζ3|,

ifσ2

2s2n

ζ2 ≤ 1. (8.4.9)

For those ζ ∈ [−T, T ] (which is exactly the range of integration operation in(8.4.5)), we can guarantee the validity of (8.4.9) by defining

T ,σ2snρ

[√2(n− d)− 3

n− d− 1

],

and obtain

σ2

2s2n

ζ2 ≤ σ2

2s2n

T 2 ≤ σ6

2ρ2

[√2(n− d)− 3

n− d− 1

]2

≤ 1

2

[√2(n− d)− 3

n− d− 1

]2

≤ 1,

for n− d ≥ 3. Hence, for |ζ| ≤ T ,∣∣∣∣ϕ( ζ

sn

)∣∣∣∣ ≤ 1 +

(− σ2

2s2n

ζ2 +ρ

6s3n

|ζ3|)

≤ exp

− σ2

2s2n

ζ2 +ρ

6s3n

|ζ3|

≤ exp

− σ2

2s2n

ζ2 +ρ

6s3n

Tζ2

= exp

−(σ2

2s2n

− ρT

6s3n

)ζ2

= exp

−(3−

√2)(n− d)

6(n− d− 1)

σ2

s2n

ζ2

.

166

We can then choose

γ , exp

−(3−

√2)(n− d)

6(n− d− 1)

σ2

s2n

ζ2

.

The above selected γ is apparently an upper bound of exp −(1/2) (σ2/s2n) ζ2.

By taking the chosen γ into (8.4.8), the integration part in (8.4.5) becomes∫ T

−T

∣∣∣∣ϕd( ζ

sn

)ϕ(n−d)

(ζ

sn

)− e−(1/2)ζ2

∣∣∣∣ dζ|ζ|≤

∫ T

−T(n− d)

(ρ

6s3n

ζ2 +σ4

8s4n

|ζ|3)

(8.4.10)

× exp

−(

3σ2d+ (3−√

2)σ2(n− d)) ζ2

6s2n

dζ

≤∫ T

−T(n− d)

(ρ

6s3n

ζ2 +σ4

8s4n

|ζ|3)

× exp

−(

(3−√

2)σ2d+ (3−√

2)σ2(n− d)) ζ2

6s2n

dζ

=

∫ T

−T(n− d)

(ρ

6s3n

ζ2 +σ4

8s4n

|ζ|3)· exp

−(3−

√2)

6ζ2

dζ

=

∫ ∞−∞

(n− d)

(ρ

6s3n

ζ2 +σ4

8s4n

|ζ|3)· exp

−(3−

√2)

6ζ2

dζ

=(n− d)σ2

s2n

· ρ

σ2sn·

(3√

6π

6(3−√

2)3/2+

9

2(11− 6√

2)· σ

3

ρ· σsn

)

≤ 1

T

(√2(n− d)− 3

n− d− 1

)(3√

6π

6(3−√

2)3/2+

9

2(11− 6√

2)

1√n− d

),(8.4.11)

where the last inequality follows respectively from

• (n− d)σ2 ≤ s2n,

• ρ/(σ2sn) = (1/T )(√

2(n− d)− 3)/(n− d− 1),

• σ3 ≤ ρ,

• and σ/sn ≤ 1/√n− d.

Taking (8.4.11) into (8.4.5), we finally obtain

π · η ≤ 1

T

(√2(n− d)− 3

n− d− 1

)(3√

6π

6(3−√

2)3/2+

9

2(11− 6√

2)

1√n− d

)

+12

T√

2πh

(T√

2π

2η

),

167

-1

-0.5

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

π

6u− h(u)

u

Figure 8.3: Function of (π/6)u− h(u).

or equivalently,

π

6u− h(u) ≤

√2π[√

2(n− d)− 3]

12(n− d− 1)

( √6π

2(3−√

2)3/2+

9

2(11− 6√

2)√n− d

),

(8.4.12)for u , ηT

√2π/2. Observe that the continuous function (π/6)u− h(u) equals 0

at u = u0 ≈ 2.448, is negative for 0 < u < u0, and is positive for u > u0. Alsoobserve that (π/6)u− h(u) is monotonely increasing (up to infinity) for u > u0

(cf. Figure 8.3). Inequality (8.4.12) is thus equivalent to

u ≤ Cn,d,

where Cn,d (> u0) is defined in the statement of the theorem. The proof iscompleted by

η = u2

T√

2π≤ Cn,d

2

T√

2π= Cn,d

2√π

(n− d− 1)

(2(n− d)− 3√

2)

ρ

σ2sn.

2

8.4.2 Berry-Esseen Theorem with a sample-size depen-dent multiplicative coefficient for i.i.d. sequence

By letting d = 0, the Berry-Esseen inequality for i.i.d. sequences can also bereadily obtained from Theorem 8.29.

168

Corollary 8.30 (Berry-Esseen theorem with a sample-size dependentmultiplicative coefficient for i.i.d. sequence) Let Yn =

∑ni=1Xi be the sum

of independent random variables with common marginal distribution. Denotethe marginal mean and variance by (µ, σ2). Define ρ , E

[|X1 − µ|3

]. Also

denote the cdf of (Yn − nµ)/(√nσ) by Hn(·). Then for all y ∈ <,

|Hn(y)− Φ(y)| ≤ Cn2(n− 1)

√π(2n− 3

√2) ρ

σ3√n,

where Cn is the unique positive solution of

π

6u− h(u) =

√π(2n− 3

√2)

12(n− 1)

( √6π

2(3−√

2)3/2

+9

2(11− 6√

2)√n

),

provided that n ≥ 3.

Let us briefly remark on the previous corollary. We observe from numericals2

that the quantity

Cn2√π

(n− 1)(2n− 3

√2)

is decreasing in n, and ranges from 3.628 to 1.627 (cf. Figure 8.4.) Numericalresult shows that it lies below 2 when n ≥ 9, and is smaller than 1.68 as n ≥ 100.In other words, we can upperbound this quantity by 1.68 as n ≥ 100, andtherefore, establish a better estimate of the Berry-Esseen constant than that in[5, sec. XVI. 5, Thm. 1].

2We can upperbound Cn by the unique positive solution Dn of

π

6u− h(u) =

√π

6

√6π

2(3−√

2)3/2 +

9

2(11− 6√

2)√n

,

which is strictly decreasing in n. Hence,

Cn2√π

(n− 1)(2n− 3

√2) ≤ En , Dn

2√π

(n− 1)(2n− 3

√2) ,

and the right-hand-side of the above inequality is strictly decreasing (since both Dn and(n − 1)/(2n − 3

√2) are decreasing) in n, and ranges from E3 = 4.1911, . . . ,E9 = 2.0363,. . . ,

E100 = 1.6833 to E∞ = 1.6266. If the property of strictly decreasingness is preferred, one canuse Dn instead of Cn in the Berry-Esseen inequality. Note that both Cn and Dn converges to2.8831 . . . as n goes to infinity.

169

0

1

1.68

2

3

3 10 20 30 40 50 75 100 150 200

Cn2√π

(n− 1)

(2n− 3√

2)

n

Figure 8.4: The Berry-Esseen constant as a function of the sample sizen. The sample size n is plotted in log-scale.

8.4.3 Probability bounds using Berry-Esseen inequality

We now investigate an upper probability bound for the sum of a compoundi.i.d. sequence.

Basically, the approach used here is the large deviation technique, which isconsidered the technique of computing the exponent of an exponentially decayedprobability. Other than the large deviations, the previously derived Berry-Esseentheorem is applied to evaluate the subexponential detail of the desired probability.With the help of these two techniques, we can obtain a good estimate of thedestined probability. Some notations used in large deviations are introducedhere for the sake of completeness.

Let Yn =∑n

i=1Xi be the sum of independent random variables. Denote thedistribution of Yn by Fn(·). The twisted distribution with parameter θ corre-sponding to Fn(·) is defined by

dF (θ)n (y) ,

eθy dFn(y)∫Y e

θy dFn(y)=eθy dFn(y)

Mn(θ), (8.4.13)

whereMn(θ) is the moment generating function corresponding to the distribution

Fn(·). Let Y(θ)n be a random variable having F

(θ)n (·) as its probability distribution.

Analogously, we can twist the distribution of Xi with parameter θ, yielding its

170

twisted counterpart X(θ)i . A basic large deviation result is that

Y (θ)n =

n∑i=1

X(θ)i ,

and X(θ)i ni=1 still forms a compound i.i.d. sequence.

Lemma 8.31 (probability upper bound) Let Yn =∑n

i=1Xi be the sum ofindependent random variables, among which Xidi=1 are identically Gaussiandistributed with mean µ > 0 and variance σ2, and Xini=d+1 have commondistribution as minXi, 0. Denote the variance and the absolute third moment

of the twisted random variable X(θ)d+1 by σ2(θ) and ρ(θ), respectively. Also define

s2n(θ) , Var[Y

(θ)n ] and Mn(θ) , E

[eθYn

]. Then

Pr Yn ≤ 0 ≤ B(d, (n− d);µ2, σ2

),

An,dMn(θ), when E[Yn] ≥ 0;1−Bn,dMn(θ), otherwise,

where

An,d , min

(1√

2π|θ|sn(θ)+ Cn,d

4(n− d− 1)ρ(θ)√π[2(n− d)− 3

√2]σ2(θ)sn(θ)

, 1

),

Bn,d ,1

|θ|sn(θ)

× exp

−|θ|sn(θ)Φ−1

(Φ(0) +

4Cn,d(n− d− 1)ρ(θ)√π[2(n− d)− 3

√2]σ2(θ)sn(θ)

+1

|θ|sn(θ)

),

and θ is the unique solution of ∂Mn(θ)/∂θ = 0, provided that n− d ≥ 3. (An,d,Bn,d and θ are all functions of µ2 and σ2. Here, we drop them to simplify thenotations.)

Proof: We derive the upper bound of Pr (Yn ≤ 0) under two different situations:

E[Yn] ≥ 0 and E[Yn] < 0.

Case A) E[Yn] ≥ 0. Denote the cdf of Yn by Fn(·). Choose θ to satisfy∂Mn(θ)/∂θ = 0. By the convexity of the moment generating functionMn(θ) with respect to θ and the nonnegativity of E[Yn] = ∂Mn(θ)/∂θ|θ=0,the θ satisfying ∂Mn(θ)/∂θ = 0 uniquely exists and is non-positive. Denote

171

the distribution of(Y

(θ)n − E[Y

(θ)n ])/sn(θ) by Hn(·). Then from (8.4.13),

we have

Pr (Yn ≤ 0) =

∫ 0

−∞dFn(y)

= Mn(θ)

∫ 0

−∞e− θydF (θ)

n (y)

= Mn(θ)

∫ 0

−∞e− θ sn(θ) ydHn(y). (8.4.14)

Integrating by parts on (8.4.14) with

λ(dy) , −θ sn(θ) e− θ sn(θ) ydy,

and then applying Theorem 8.29 yields∫ 0

−∞e− θ sn(θ) ydHn(y)

=

∫ 0

−∞[Hn(0)−Hn(y)]λ(dy)

≤∫ 0

−∞

[Φ(0)− Φ(y) + Cn,d

4(n− d− 1)ρ(θ)√π(2(n− d)− 3

√2)σ2(θ)sn(θ)

]λ(dy)

=

∫ 0

−∞[Φ(0)− Φ(y)]λ(dy) + Cn,d

4(n− d− 1)ρ(θ)√π(2(n− d)− 3

√2)σ2(θ)sn(θ)

= eθ2 s2n(θ)/2Φ(θ sn(θ)) + Cn,d

4(n− d− 1)ρ(θ)√π(2(n− d)− 3

√2)σ2(θ)sn(θ)

(8.4.15)

≤ 1√2π |θ| sn(θ)

+ Cn,d4(n− d− 1)ρ(θ)

√π(2(n− d)− 3

√2)σ2(θ)sn(θ)

,

where (8.4.15) holds by, again, applying integration by part, and the lastinequality follows from for u ≥ 0,

Φ(−u) =

∫ −u−∞

1√2πe−t

2/2dt

≤∫ −u−∞

1√2π

(1 +

1

t2

)e−t

2/2dt =1√2πu

e−u2/2.

On the other hand,

∫ 0

−∞e− θ sn(θ) ydHn(y) ≤ 1 can be established by ob-

172

serving that

Mn(θ)

∫ 0

−∞e−θ sn(θ) ydHn(y) = PrYn ≤ 0

= PreθYn ≥ 1

≤ E[eθYn ] = Mn(θ).

Finally, Case A concludes to

Pr(Yn ≤ 0)

≤ min

(1√

2π |θ| sn(θ)+

4Cn,d(n− d− 1)ρ(θ)√π(2(n− d)− 3

√2)σ2(θ)sn(θ)

, 1

)Mn(θ).

Case B) E[Yn] < 0.

By following a procedure similar to case A, together with the observationthat θ > 0 for E[Yn] < 0, we obtain

Pr(Yn > 0)

=

∫ ∞0

dFn(y)

= Mn(θ)

∫ ∞0

e−θ sn(θ) ydHn(y)

≥ Mn(θ)

∫ α

0

e−θ sn(θ) ydHn(y), for some α > 0 specified later

≥ Mn(θ)e−θ sn(θ)α

∫ α

0

dHn(y)

= Mn(θ)e−θ sn(θ)α [H(α)−H(0)]

≥ Mn(θ)e−θ sn(θ)α

×

[Φ(α)− Φ(0)− Cn,d

4(n− d− 1)ρ(θ)√π(2(n− d)− 3

√2)σ2(θ)sn(θ)

].

The proof is completed by letting α be the solution of

Φ(α) = Φ(0) + Cn,d4(n− d− 1)ρ(θ)

√π(2(n− d)− 3

√2)σ2(θ)sn(θ)

+1

|θ|sn(θ).

2

The upper bound obtained in Lemma 8.31 in fact depends only upon theratio of (µ/σ) or (1/2)(µ2/σ2). Therefore, we can rephrase Lemma 8.31 as thenext corollary.

173

Corollary 8.32 Let Yn =∑n

i=1Xi be the sum of independent random variables,among which Xidi=1 are identically Gaussian distributed with mean µ > 0and variance σ2, and Xini=d+1 have common distribution as minXi, 0. Let

γ , (1/2)(µ2/σ2). Then

Pr Yn ≤ 0 ≤ B (d, (n− d); γ)

where

B (d, (n− d); γ) ,

An,d(γ)Mn(γ), ifd

n≥ 1−

√4πγ

e−γ −√

4πγΦ(√

2γ);

1−Bn,d(γ)Mn(γ), otherwise,

An,d(γ) , min

(1√

2π∣∣λ−√2γ

∣∣ sn(λ)+

4Cn,d(n− d− 1)ρ(λ)√π[2(n− d)− 3

√2]σ2(λ)sn(λ)

, 1

),

Bn,d(γ) ,1∣∣λ−√2γ∣∣ sn(λ)

exp−∣∣∣λ−√2γ

∣∣∣ sn(λ)

× Φ−1

(1

2+

4Cn,d(n− d− 1)ρ(λ)√π[2(n− d)− 3

√2]σ2(λ)sn(λ)

+1∣∣λ−√2γ∣∣ sn(θ)

),

Mn(λ) , e−nγedλ2/2[Φ (−λ) eλ

2/2 + eγΦ(√

2γ)]n−d

,

σ2(λ) , − d

n− d− nd

(n− d)2λ2 +

n

n− d1

1 +√

2πλeγΦ(√

2γ),

ρ(λ) ,n

(n− d)

λ

[1 +√

2πλeγΦ(√

2γ)]

1− d(n+ d)

(n− d)2λ2

+2

[n2

(n− d)2λ2 + 2

]e−d(2n−d)λ2/(n−d)2

− d

n− d

[n2

(n− d)2λ2 + 3

]√2πλeγΦ(

√2γ)

− 2n

n− d

[n2

(n− d)2λ2 + 3

]√2πλeλ

2/2Φ

(− n

n− dλ

),

s2n(λ) , n

(− d

n− dλ2 +

1

1 +√

2πλeγΦ(√

2γ)

),

and λ is the unique positive solution of

λe(1/2)λ2Φ(−λ) =1√2π

(1− d

n

)− d

neγΦ(

√2γ)λ,

provided that n− d ≥ 3.

174

Proof: The notations used in this proof follows those in Lemma 8.31.

Let λ , (µ/σ) + σθ =√

2γ + σθ. Then the moment generating functions ofX1 and Xd+1 , minX1, 0 are respectively

MX1(λ) = e−γeλ2/2 and MXd+1

(λ) = Φ (−λ) e−γeλ2/2 + Φ

(√2γ).

Therefore, the moment generating function of Yn =∑n

i=1Xi becomes

Mn(λ) = MdX1

(λ)Mn−dXd+1

(λ) = e−nγedλ2/2[Φ (−λ) eλ

2/2 + eγΦ(√

2γ)]n−d

.

(8.4.16)Since λ and θ have a one-to-one correspondence, solving ∂Mn(θ)/∂θ = 0 isequivalent to solving ∂Mn(λ)/∂λ = 0, which from (8.4.16) results in

e(1/2)λ2Φ(−λ) =1√2πλ

(1− d

n

)− d

neγΦ(

√2γ). (8.4.17)

As it turns out, the solution λ of the above equation depends only on γ.

Next we derive

σ2(λ) ,σ2(θ)

σ2

∣∣∣∣θ=(λ−

√2γ)/σ

, ρ(λ) ,ρ(θ)

σ3

∣∣∣∣θ=(λ−

√2γ)/σ

and

s2n(λ) ,

s2n(θ)

σ2

∣∣∣∣θ=(λ−

√2γ)/σ

.

By replacing e(1/2)λ2Φ(−λ) with

1√2πλ

(1− d

n

)− d

neγΦ(

√2γ)

(cf. (8.4.17)), we obtain

σ2(λ) = − d

n− d− nd

(n− d)2λ2 +

n

n− d1

1 +√

2πλeγΦ(√

2γ), (8.4.18)

ρ(λ) =n

(n− d)

λ

[1 +√

2πλeγΦ(√

2γ)]

1− d(n+ d)

(n− d)2λ2

+2

[n2

(n− d)2λ2 + 2

]e−d(2n−d)λ2/(n−d)2

− d

n− d

[n2

(n− d)2λ2 + 3

]√2πλeγΦ(

√2γ)

− 2n

n− d

[n2

(n− d)2λ2 + 3

]√2πλeλ

2/2Φ

(− n

n− dλ

),(8.4.19)

175

and

s2n(λ) = n

(− d

n− dλ2 +

1

1 +√

2πλeγΦ(√

2γ)

). (8.4.20)

Taking (8.4.18), (8.4.19), (8.4.20) and θ = (λ −√

2γ)/σ into An,d and Bn,d inLemma 8.31, we obtain:

An,d = min

(1√

2π∣∣λ−√2γ

∣∣ sn(λ)+ Cn,d

4(n− d− 1)ρ(λ)√π[2(n− d)− 3

√2]σ2(λ)sn(λ)

, 1

)

and

Bn,d =1∣∣λ−√2γ∣∣ sn(λ)

exp−∣∣∣λ−√2γ

∣∣∣ sn(λ)

× Φ−1

(Φ(0) +

4Cn,d(n− d− 1)ρ(λ)√π[2(n− d)− 3

√2]σ2(λ)sn(λ)

+1∣∣λ−√2γ∣∣ sn(θ)

),

which depends only on λ and γ, as desired. Finally, a simple derivation yields

E[Yn] = dE[X1] + (n− d)E[Xd+1]

= σ(d√

2γ + (n− d)[−(1/

√2π)e−γ +

√2γΦ(−

√2γ)]),

and hence,

E[Yn] ≥ 0 ⇔ d

n≥ 1−

√4πγ

e−γ −√

4πγΦ(√

2γ).

2

We close this section by remarking that the parameters in the upper boundsof Corollary 8.32, although they have complex expressions, are easily computer-evaluated, once their formulas are established.

8.5 Generalized Neyman-Pearson Hypothesis Testing

The general expression of the Neyman-Pearson type-II error exponent subject toa constant bound on the type-I error has been proved for arbitrary observations.In this section, we will state the results in terms of the ε-inf/sup-divergencerates.

Theorem 8.33 (Neyman-Pearson type-II error exponent for a fixedtest level [3]) Consider a sequence of random observations which is assumed to

176

have a probability distribution governed by either PX (null hypothesis) or PX

(alternative hypothesis). Then, the type-II error exponent satisfies

limδ↑δ

Dδ(X‖X) ≤ lim supn→∞

− 1

nlog β∗n(ε) ≤ Dε(X‖X)

limδ↑ε

Dδ(X‖X) ≤ lim infn→∞

− 1

nlog β∗n(ε) ≤ Dε(X‖X)

where β∗n(ε) represents the minimum type-II error probability subject to a fixedtype-I error bound ε ∈ [0, 1).

The general formula for Neyman-Pearson type-II error exponent subject to anexponential test level has also been proved in terms of the ε-inf/sup-divergencerates.

Theorem 8.34 (Neyman-Pearson type-II error exponent for an expo-nential test level) Fix s ∈ (0, 1) and ε ∈ [0, 1). It is possible to choose decisionregions for a binary hypothesis testing problem with arbitrary datawords ofblocklength n, (which are governed by either the null hypothesis distributionPX or the alternative hypothesis distribution PX ,) such that

lim infn→∞

− 1

nlog βn ≥ Dε(X

(s)‖X) and lim sup

n→∞− 1

nlogαn ≥ D(1−ε)(X

(s)‖X),

(8.5.1)or

lim infn→∞

− 1

nlog βn ≥ Dε(X


n→∞− 1


(s)‖X),

(8.5.2)

where X(s)

exhibits the tilted distributionsP

(s)

Xn

∞n=1

defined by

dP(s)

Xn(xn) ,

1

Ωn(s)exp

s log

dPXn

dPXn

(xn)

dPXn(xn),

and

Ωn(s) ,∫Xn

exp

s log

dPXn

dPXn

(xn)

dPXn(xn).

Here, αn and βn are the type-I and type-II error probabilities, respectively.

Proof: For ease of notation, we use X to represent X(s)

. We only prove (8.5.1);(8.5.2) can be similarly demonstrated.

177

By definition of dP(s)

Xn(·), we have

1

s

[1

ndXn(Xn‖Xn)

]+

1

1− s

[1

ndXn(Xn‖Xn)

]= − 1

s(1− s)

[1

nlog Ωn(s)

].

(8.5.3)Let Ω , lim supn→∞(1/n) log Ωn(s). Then, for any γ > 0, ∃N0 such that ∀ n >N0,

1

nlog Ωn(s) < Ω + γ.

From (8.5.3),

dXn‖Xn(θ)

, lim infn→∞

Pr

1


= lim inf

n→∞Pr

− 1

1− s

[1

ndXn(Xn‖Xn)

]− 1

s(1− s)

[1

nlog Ωn(s)

]≤ θ

s

= lim inf

n→∞Pr

1

ndXn(Xn‖Xn) ≥ −1− s

sθ − 1

s

[1

nlog Ωn(s)

]≤ lim inf

n→∞Pr

1

ndXn(Xn‖Xn) > −1− s

sθ − 1

sΩ− γ

s

= 1− lim sup

n→∞Pr

1

ndXn(Xn‖Xn) ≤ −1− s

sθ − 1

sΩ− γ

s

= 1− dXn‖Xn

(−1− s

sθ − 1

sΩ− γ

s

).

Thus,

Dε(X‖X) , supθ : dXn‖Xn(θ) ≤ ε

≥ sup

θ : 1− dXn‖Xn

(−1− s

sθ − 1

s(Ω + γ)

)< ε

= sup

− 1

1− s(Ω + γ)− s

1− sθ′ : dXn‖Xn(θ′) > 1− ε

= − 1

1− s(Ω + γ)− s

1− sinfθ′ : dXn‖Xn(θ′) > 1− ε

= − 1

1− s(Ω + γ)− s

1− ssupθ′ : dXn‖Xn(θ′) ≤ 1− ε

= − 1

1− s(Ω + γ)− s

1− sD1−ε(X‖X).

Finally, choosing the acceptance region for null hypothesis asxn ∈ X n :

1

nlog

dPXn

dPXn

(xn) ≥ Dε(X‖X)

,

178

we obtain:

βn = PXn

1

nlog

dPXn

dPXn

(Xn) ≥ Dε(X‖X)

≤ exp

−nDε(X‖X)

,

and

αn = PXn

1

nlog

dPXn

dPXn

(Xn) < Dε(X‖X)

≤ PXn

1

nlog

dPXn

dPXn

(Xn) < − 1

1− s(Ω + γ)− s

1− sD1−ε(X‖X)

= PXn

−1

1− s

[1

nlog

dPXn

dPXn

(Xn)

]− 1

s(1− s)

[1

nlog Ωn(s)

]< − Ω + γ

s(1− s)− 1

1− sD1−ε(X‖X)

= PXn

1

nlog

dPXn

dPXn

(Xn) > D1−ε(X‖X) +1

s

[Ω− 1

nlog Ωn(s)

]+γ

s

.

Then, for n > N0,

αn ≤ PXn

1

nlog

dPXn

dPXn

(Xn) > D1−ε(X‖X)

≤ exp

−nD1−ε(X‖X)

.

Consequently,

lim infn→∞

− 1

nlog βn ≥ Dε(X


n→∞− 1


(s)‖X).

2

179

Bibliography


[2] James A. Bucklew. Large Deviation Techniques in Decision, Simulation,and Estimation. Wiley, New York, 1990.

[3] P.-N. Chen, “General formulas for the Neyman-Pearson type-II errorexponent subject to fixed and exponential type-I error bound,” IEEETrans. Info. Theory, vol. 42, no. 1, pp. 316–323, November 1993.

[4] Jean-Dominique Deuschel and Daniel W. Stroock. Large Deviations. Aca-demic Press, San Diego, 1989.

[5] W. Feller, An Introduction to Probability Theory and its Applications, 2ndedition, New York, John Wiley and Sons, 1970.

180

Chapter 9

Channel Reliability Function

The channel coding theorem (cf. Chapter 5) shows that block codes with arbi-trarily small probability of block decoding error exist at any code rate smallerthan the channel capacity C. However, the result mentions nothing on the re-lation between the rate of convergence for the error probability and code rate(specifically for rate less than C.) For example, if C = 1 bit per channel usage,and there are two reliable codes respectively with rates 0.5 bit per channel usageand 0.25 bit per channel usage, is it possible for one to adopt the latter code evenif it results in lower information transmission rate? The answer is affirmative ifa higher error exponent is required.

In this chapter, we will first re-prove the channel coding theorem for DMC,using an alternative and more delicate method that describes the dependencebetween block codelength and the probability of decoding error.

9.1 Random-coding exponent

Definition 9.1 (channel block code) An (n,M) code for channel W n withinput alphabet X and output alphabet Y is a pair of mappings

f : 1, 2, · · · ,M → X n and g : Yn → 1, 2, · · · ,M.Its average error probability is given by

Pe(n,M)4=

1

M

M∑m=1

∑yn:g(yn) 6=m

W n(yn|f(m)).

Definition 9.2 (channel reliability function [1]) For any R ≥ 0, define thechannel reliability function E∗(R) for a channel W as the largest scalar β > 0such that there exists a sequence of (n,Mn) codes with

β ≤ lim infn→∞

− 1

nlogPe(n,Mn)

181

and

R ≤ lim infn→∞

1

nlogMn. (9.1.1)

From the above definition, it is obvious that E∗(0) = ∞; hence; in derivingthe bounds on E∗(R), we only need to consider the case of R > 0.

Definition 9.3 (random-coding exponent) The random coding exponentfor DMC with generic distribution PY |X is defined by

Er(R) , sup0≤s≤1

supPX

−sR− log∑y∈Y

(∑x∈X

PX(x)P1/(1+s)Y |X (y|x)

)1+s .

Theorem 9.4 (random coding exponent) For DMC with generic transitionprobability PY |X ,

E∗(R) ≥ Er(R)

for R ≥ 0.

Proof: The theorem holds trivially at R = 0 since E∗(0) = ∞. It remains toprove the theorem under R > 0.

Similar to the proof of channel capacity, the codeword is randomly selectedaccording to some distribution PXn . For fixed R > 0, choose a sequence ofMnn≥1 with R = limn→∞(1/n) logMn.

Step 1: Maximum likelihood decoder. Let c1, . . . , cMn ∈ X n denote theset of n-tuple block codewords selected, and let the decoding partition forsymbol m (namely, the set of channel outputs that classify to m) be

Um , yn : PY n|Xn(yn|cm) > PY n|Xn(yn|cm′), for all m′ 6= m.

Those channel outputs that are on the boundary, i.e.,

for some m and m′, PY n|Xn(yn|cm) = PY n|Xn(yn|cm′),

will be arbitrarily assigned to either m or m′.

Step 2: Property of indicator function for s ≥ 0. Let φm be the indicatorfunction of Um. Then1 for all s ≥ 0,

1− φm(yn) ≤

∑m′ 6=m,1≤m′≤Mn

[PY n|Xn(yn|cm′)PY n|Xn(yn|cm)

]1/(1+s)s

.

1(∑

i a1/(1+s)i

)sis no less than 1 when one of ai is no less than 1.

182

Therefore,

E[Pe|m]

≤∑yn∈Yn

E[P


]E

[ ∑m′ 6=m,1≤m′≤Mn


s]

≤∑yn∈Yn

E[P


](E

[ ∑m′ 6=m,1≤m′≤Mn


])s

=∑yn∈Yn

E[P


]( ∑m′ 6=m,1≤m′≤Mn

E[P

1/(1+s)Y n|Xn (yn|cm′)

])s

Since the codewords are selected with identical distribution, the expecta-tions

E[P


]should be the same for each m. Hence,

E[Pe|m]

≤∑yn∈Yn

E[P


]( ∑m′ 6=m,1≤m′≤Mn

E[P

1/(1+s)Y n|Xn (yn|cm′)

])s

=∑yn∈Yn

E[P


] ((Mn − 1)E

[P


])s= (Mn − 1)s

∑yn∈Yn

(E[P


])1+s

≤ M sn

∑yn∈Yn

(E[P


])1+s

= M sn

∑yn∈Yn

( ∑xn∈Xn

PXn(xn)P1/(1+s)Y n|Xn (yn|xn)

)1+s

Since the upper bound of E[Pe|m] is no longer dependent on m, the ex-pected average error probability E[Pe] can certainly be bounded by thesame bound, namely,

E[Pe] ≤M sn

∑yn∈Yn

( ∑xn∈Xn


)1+s

, (9.1.2)

which immediately implies the existence of an (n,Mn) code satisfying:

Pe(n,Mn) ≤M sn

∑yn∈Yn

( ∑xn∈Xn


)1+s

.

184

By using the fact that PXn and PY n|Xn are product distributions withidentical marginal, and taking logarithmic operation on both sides of theabove inequality, we obtain:

− 1

nlogPe(n,Mn) ≥ − s

nlogMn − log

∑y∈Y

(∑x∈X

PX(x)P1/(1+s)Y |X (y|x)

)1+s

,

which implies

lim infn→∞

− 1

nlogPe(n,Mn) ≥ −sR− log

∑y∈Y

(∑x∈X

PX(x)P1/(1+s)Y |X (y|x)

)1+s

.

The proof is completed by noting that the lower bound holds for any s ∈[0, 1] and any PX . 2

9.1.1 The properties of random coding exponent

We re-formulate Er(R) to facilitate the understanding of its behavior as follows.

Definition 9.5 (random-coding exponent) The random coding exponentfor DMC with generic distribution PY |X is defined by

Er(R) , sup0≤s≤1

[−sR + E0(s)] ,

where

E0(s) , supPX

− log∑y∈Y

(∑x∈X

PX(x)P1/(1+s)Y |X (y|x)

)1+s .

Based on the reformulation, the properties of Er(R) can be realized via theanalysis of function E0(s).

Lemma 9.6 (properties of Er(R))

1. Er(R) is convex and non-increasing in R; hence, it is a strict decreasingand continuous function of R ∈ [0,∞).

2. Er(C) = 0, where C is channel capacity, and Er(C− δ) > 0 for all 0 < δ <C.

3. There exists Rcr such that for 0 < R < Rcr, the slope of Er(R) is −1.

185

Proof:

1. Er(R) is convex and non-increasing, since it is the supremum of affinefunctions −sR+E0(s) with non-positive slope −s. Hence, Er(R) is strictlydecreasing and continuous for R > 0.

2. −sR + E0(s) equals zero at R = E0(s)/s for 0 < s ≤ 1. Hence by thecontinuity of Er(R), Er(R) = 0 for R = sup0<s≤1E0(s)/s, and Er(R) > 0for R < sup0<s≤1E0(s)/s. The property is then justified by:

sup0<s≤1

1

sE0(s) = sup

0<s≤1supPX

−1

slog∑y∈Y

(∑x∈X

PX(x)P1/(1+s)Y |X (y|x)

)1+s

= supPX

sup0<s≤1

−1

slog∑y∈Y

(∑x∈X

PX(x)P1/(1+s)Y |X (y|x)

)1+s

= supPX

lims↓0

−1

slog∑y∈Y

(∑x∈X

PX(x)P1/(1+s)Y |X (y|x)

)1+s

= supPX

I(X;Y ) = C.

3. The intersection between −sR + E0(s) and −R + E0(1) is

Rs =E0(1)− E0(s)

1− s;

and below Rs, −sR + E0(s) ≤ −R + E0(1). Therefore,

Rcr = inf0≤s≤1

E0(1)− E0(s)

1− s.

So below Rcr, the slope of Er(R) remains −1. 2

Example 9.7 For BSC with crossover probability ε, the random coding expo-nent becomes

Er(R) = max0≤p≤1

max0≤s≤1

−sR− log

[(p(1− ε)1/(1+s) + (1− p)ε1/(1+s)

)(1+s)

+(pε1/(1+s) + (1− p)(1− ε)1/(1+s)

)(1+s)]

,

where (p, 1− p) is the input distribution.

Note that the input distribution achieving Er(R) is uniform, i.e., p = 1/2.Hence,

E0(s) = s log(2)− (1 + s) log(ε1/(1+s) + (1− ε)1/(1+s)

)

186

@@@

@@

@@

1− p

p

1

0

1

0

1− ε

1− ε

ε

ε

Figure 9.1: BSC channel with crossover probability ε and input distri-bution (p, 1− p).

and

Rcr = inf0≤s≤1

E0(1)− E0(s)

1− s= lim

s↑1

∂E0(s)

∂s

= log(2)− log(√

ε+√

1− ε)

+

√ε log(ε) +

√1− ε log(1− ε)

2(√ε+√

1− ε).

The random coding exponent for ε = 0.2 is depicted in Figure 9.2.

9.2 Expurgated exponent

The proof of the random coding exponent is based on the argument of randomcoding, which selects codewords independently according to some input distribu-tion. Since the codeword selection is unbiased, the “good” codewords (i.e., withsmall Pe|m) and “bad” codewords (i.e., with large Pe|m) contribute the same tothe overall average error probability. Therefore, if, to some extent, we can “ex-purgate” the contribution of the “bad” codewords, a better bound on channelreliability function may be found.

In stead of randomly selecting Mn codewords, expurgated approach firstdraws 2Mn codewords to form a codebook C∼2Mn , and sorts these codewordsin ascending order in terms of Pe|m( C∼2Mn), which is the error probability givencodeword m is transmitted. After that, it chooses the first Mn codewords (whosePe|m( C∼2Mn) is smaller) to form a new codebook, C∼Mn . It can be expected thatfor 1 ≤ m ≤Mn and Mn < m′ ≤ 2Mn,

Pe|m( C∼Mn) ≤ Pe|m( C∼2Mn) ≤ Pe|m′( C∼2Mn),

provided that the maximum-likelihood decoder is always employed at the receiverside. Hence, a better codebook is obtained.

187

-1

0

0.1

0 Rcr 0.08 0.12 0.16 0.192745

C

Er(R)

s∗

Figure 9.2: Random coding exponent for BSC with crossover proba-bility 0.2. Also plotted is s∗ = arg sup0≤s≤1[−sR − E0(s)]. Rcr =0.056633.

To further reduce the contribution of the “bad” codewords when comput-ing the expected average error probability under random coding selection, abetter bound in terms of Lyapounov’s inequality2 is introduced. Note that byLyapounov’s inequality,

E1/ρ[P ρe|m( C∼2Mn)

]≤ E

[Pe|m( C∼2Mn)

].

for 0 < ρ ≤ 1.

Similar to the random coding argument, we need to claim the existence ofone good code whose error probability vanishes at least at the rate of expurgatedexponent. Recall that the random coding exponent comes to this conclusion bythe simple fact that E[Pe] ≤ e−nEr(R) (cf. (9.1.2)) implies the existence of codewith Pe ≤ e−nEr(R). However, a different argument is adopted for the existenceof such “good” codes for expurgated exponent.

Lemma 9.8 (existence of “good” code for expurgated exponent) There

2Lyapounov’s inequality:

E1/α[|X|α] ≤ E1/β [|X|β ] for 0 < α ≤ β.

188

exists an (n,Mn) block code C∼Mn with

Pe|m( C∼Mn) ≤ 21/ρE1/ρ[P ρe|m( C∼2Mn)

], (9.2.1)

where C∼2Mn is randomly selected according to some distribution PXn .

Proof: Let the random codebook C∼2Mn be denoted by

C∼2Mn = c1, c2, . . . , c2Mn.

Let φ(·) be the indicator function of the set

t ∈ < : t < 21/ρE1/ρ[P ρe|m( C∼2Mn)],

for some ρ > 0. Hence, φ(Pe|m( C∼2Mn) ) = 1 if

Pe|m( C∼2Mn) < 21/ρE1/ρ[P ρe|m( C∼2Mn)].

By Markov’s inequality,

E

[ ∑1≤m≤2Mn

φ(Pe|m( C∼2Mn))

]=

∑1≤m≤2Mn

E[φ(Pe|m( C∼2Mn))

]= 2MnE

[φ(Pe|m( C∼2Mn))

]= 2MnPr

Pe|m( C∼2Mn) < 21/ρE1/ρ[P ρ

e|m( C∼2Mn)]

= 2MnPrP ρe|m( C∼2Mn) < 2E[P ρ

e|m( C∼2Mn)]

≥ 2Mn

(1−

E[P ρe|m( C∼2Mn)]

2E[P ρe|m( C∼2Mn)]

)= Mn.

Therefore, there exist at lease one codebook such that∑1≤m≤2Mn

φ(Pe|m( C∼2Mn)) ≥Mn.

Hence, by selecting Mn codewords from this codebook with φ(Pe|m( C∼2Mn)) = 1,a new codebook is formed, and it is obvious that (9.2.1) holds for this newcodebook. 2

189

Definition 9.9 (expurgated exponent) The expurgated exponent for DMCwith generic distribution PY |X is defined by

Eex(R) , sups≥1

supPX

[−sR− s log

∑x∈X

∑x′∈X

PX(x)PX(x′)

(∑y∈Y

√PY |X(y|x)PY |X(y|x′)

)1/s .

Theorem 9.10 (expurgated exponent) For DMC with generic transitionprobability PY |X ,

E∗(R) ≥ Eex(R)

for R ≥ 0.

Proof: The theorem holds trivially at R = 0 since E∗(0) = ∞. It remains toprove the theorem under R > 0.

Randomly select 2Mn codewords according to some distribution PXn . Forfixed R > 0, choose a sequence of Mnn≥1 with R = limn→∞(1/n) logMn.

Step 1: Maximum likelihood decoder. Let c1, . . . , c2Mn ∈ X n denote theset of n-tuple block codewords selected, and let the decoding partition forsymbol m (namely, the set of channel outputs that classify to m) be

Um = yn : PY n|Xn(yn|cm) > PY n|Xn(yn|cm′), for all m′ 6= m.

Those channel outputs that are on the boundary, i.e.,

for some m and m′, PY n|Xn(yn|cm) = PY n|Xn(yn|cm′),

will be arbitrarily assigned to either m or m′.

Step 2: Property of indicator function for s = 1. Let φm be the indicatorfunction of Um. Then for all s > 0.

1− φm(yn) ≤

∑m′ 6=m


]1/(1+s)s

.

(Note that this step is the same as random coding exponent, except onlys = 1 is considered.) By taking s = 1, we have

1− φm(yn) ≤

∑m′ 6=m


]1/2.

190

Step 3: Probability of error given codeword cm is transmitted. LetPe|m( C∼2Mn) denote the probability of error given codeword cm is transmit-ted. Then

Pe|m( C∼2Mn)

≤∑yn 6∈Um

PY n|Xn(yn|cm)

=∑yn∈Yn

PY n|Xn(yn|cm)[1− φm(yn)]

≤∑yn∈Yn

PY n|Xn(yn|cm)

∑m′ 6=m,1≤m′≤2Mn


]1/2

≤∑yn∈Yn

PY n|Xn(yn|cm)

∑1≤m′≤2Mn


]1/2

=∑

1≤m′≤2Mn

∑yn∈Yn

PY n|Xn(yn|cm)


]1/2

=∑

1≤m′≤2Mn

∑yn∈Yn

√PY n|Xn(yn|cm)PY n|Xn(yn|cm′)

Step 4: Standard inequality for s′ = 1/ρ ≥ 1. It is known that for any0 < ρ = 1/s′ ≤ 1, we have (∑

i

ai

)ρ

≤∑i

aρi ,

for all non-negative sequence3 ai.

3Proof: Let f(ρ) = (∑i aρi ) /(∑

j aj

)ρ. Then we need to show f(ρ) ≥ 1 when 0 < ρ ≤ 1.

Let pi = ai/(∑k ak), and hence, ai = pi(

∑k ak). Take it to the numerator of f(ρ):

f(ρ) =

∑i pρi (∑k ak)ρ

(∑j aj)

ρ

=∑i

pρi

Now since ∂f(ρ)/∂ρ =∑i log(pi)p

ρi ≤ 0 (which implies f(ρ) is non-increasing in ρ) and

f(1) = 1, we have the desired result. 2

191

Hence,

P ρe|m( C∼2Mn) ≤

( ∑1≤m′≤2Mn

∑yn∈Yn


)ρ

≤∑

1≤m′≤2Mn

∑yn∈Yn


ρ

Step 5: Expectation of P ρe|m( C∼2Mn).

E[P ρe|m( C∼2Mn)]

≤ E

[ ∑1≤m′≤2Mn

∑yn∈Yn


ρ]

=∑

1≤m′≤2Mn

E

[ ∑yn∈Yn


ρ]

= 2MnE

[ ∑yn∈Yn


ρ].

Step 6: Lemma 9.8. From Lemma 9.8, there exists one codebook with sizeMn such that

Pe|m( C∼Mn)

≤ 21/ρE1/ρ[P ρe|m( C∼2Mn)]

≤ 21/ρ

(2MnE

[ ∑yn∈Yn


ρ])1/ρ

= (4Mn)1/ρE1/ρ

[ ∑yn∈Yn


ρ]

= (4Mn)s′Es′

∑yn∈Yn


1/s′

= (4Mn)s′

∑xn∈Xn

∑(x′)n∈Xn

PXn(xn)PXn((x′)n)

×

∑yn∈Yn

√PY n|Xn(yn|xn)PY n|Xn(yn|(x′)n)

1/s′s′

.

192

By using the fact that PXn and PY n|Xn are product distributions with iden-tical marginal, and taking logarithmic operation on both sides of the aboveinequality, followed by taking the liminf operation, the proof is completedby the fact that the lower bound holds for any s′ ≥ 1 and any PX . 2

As a result of the expurgated exponent obtained, it only improves the randomcoding exponent at lower rate. It is even worse than random coding exponentfor rate higher Rcr. One possible reason for its being worse at higher rate maybe due to the replacement of the indicator-function argument

1− φm(yn) ≤

∑m′ 6=m


]1/(1+s)s

.

by

1− φm(yn) ≤[PY n|Xn(yn|cm′)PY n|Xn(yn|cm)

]1/2

.

Note that

mins>0

∑m′ 6=m


]1/(1+s)s

≤∑m′ 6=m


]1/2

.

The improvement of expurgated bound over random coding bound at lowerrate may also reveal the possibility that the contribution of “bad codes” underunbiased random coding argument is larger at lower rate; hence, suppressing theweights from “bad codes” yield better result.

9.2.1 The properties of expurgated exponent

Analysis of Eex(R) is very similar to that of Er(R). We first re-write its formula.

Definition 9.11 (expurgated exponent) The expurgated exponent for DMCwith generic distribution PY |X is defined by

Eex(R) , maxs≥1

[−sR + Ex(s)] ,

where

Ex(s) , maxPX

−s log∑x∈X

∑x′∈X

PX(x)PX(x′)

(∑y∈Y


)1/s .

193

The properties of Eex(R) can be realized via the analysis of function Ex(s)as follows.

Lemma 9.12 (properties of Eex(R))

1. Eex(R) is non-increasing.

2. Eex(R) is convex in R. (Note that the first two properties imply that Eex(R)is strictly decreasing.)

3. There exists Rcr such that for R > Rcr, the slope of Er(R) is −1.

Proof:

1. Again, the slope of Eex(R) is equal to the maximizer s∗ for Eex(R) times−1. Therefore, the slope of Eex(R) is non-positive, which certainly impliesthe desired property.

2. Similar to the proof of the same property for random coding exponent.

3. Similar to the proof of the same property for random coding exponent. 2

Example 9.13 For BSC with crossover probability ε, the expurgated exponentbecomes

Eex(R)

= max1≥p≥0

maxs≥1

−sR− s log

[p2 + 2p(1− p)

(2√ε(1− ε)

)1/s

+ (1− p)2

]where (p, 1 − p) is the input distribution. Note that the input distributionachieving Eex(R) is uniform, i.e., p = 1/2. The expurgated exponent, as well asrandom coding exponent, for ε = 0.2 is depicted in Figures 9.3 and 9.4.

9.3 Partitioning bound: an upper bounds for channel re-liability

In the previous sections, two lower bounds of the channel reliability, randomcoding bound and expurgated bound, are discussed. In this section, we will startto introduce the upper bounds of the channel reliability function for DMC, whichincludes partitioning bound, sphere-packing bound and straight-line bound.

194

0

0.02

0.04

0.06

0.08

0.1

0.12

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.192745

C

Figure 9.3: Expurgated exponent (solid line) and random coding ex-ponent (dashed line) for BSC with crossover probability 0.2 (over therange of (0, 0.192745)).

The partitioning bound relies heavily on theories of binary hypothesis testing.This is because the receiver end can be modeled as:

H0 : cm = codeword transmitted

H1 : cm 6= codeword transmitted,

where m is the final decision made by receiver upon receipt of some channeloutputs. The channel decoding error given that codeword m is transmittedbecomes the type II error, which can be computed using the theory of binaryhypothesis testing.

Definition 9.14 (partitioning bound) For a DMC with marginal PY |X ,

Ep(R) , maxPX

minPY |X : I(PX ,PY |X)≤R

D(PY |X‖PY |X |PX).

One way to explain the above definition of partitioning bound is as follows.For a given i.i.d. source with marginal PX , to transmit it into a dummy channelwith rate R and dummy capacity I(PX , PY |X) ≤ R is expected to have poor

performance (note that we hope the code rate to be smaller than capacity).We can expect that those “bad” noise patterns whose composition transitionprobability PY |X satisfies I(PX , PY |X) ≤ R will contaminate the codewords more

195

0.1

0.105

0.11

0 0.001 0.002 0.003 0.004 0.005 0.006

Figure 9.4: Expurgated exponent (solid line) and random coding ex-ponent (dashed line) for BSC with crossover probability 0.2 (over therange of (0, 0.006)).

seriously than other noise patterns. Therefore for any codebook with rate R, theprobability of decoding error is lower bounded by the probability of these “bad”noise patterns, since the “bad” noise patterns anticipate to unrecoverably distortthe codewords. Accordingly, the channel reliability function is upper boundedby the exponent of the probability of these “bad” noise patterns.

Sanov’s theorem already reveal that for any set consisting of all elements withthe same compositions, its exponent is equal to the minimum divergence of thecomposition distribution against the true distribution. This is indeed the casefor the set of “bad” noise patterns. Therefore, the exponent of the probabilityof these “bad” noise patterns is given as:

minPY |X : I(PX ,PY |X)≤R


We can then choose the best input source to yield the partitioning exponent.

The partitioning exponent can be easily re-formulated in terms of divergences.This is because the mutual information can actually be written in the form of

196

divergence.

I(PX , PY |X) ,∑x

∑y

PX(x)PY |X(y|x) logPXY (x, y)

PX(x)PY (y)

=∑x

PX(x)∑y

PY |X(y|x) logPY |X(y|x)

PY (y)

=∑x

PX(x)D(PY |X‖PY |X = x)

= D(PY |X‖PY |PX)

Hence, the partitioning exponent can be written as:

Ep(R) , maxPX

minPY |X : D(P

Y |X‖PY |PX)≤RD(PY |X‖PY |X |PX).

In order to apply the theory of hypothesis testing, the partitioning bound needsto be further re-formulated.

Observation 9.15 (partitioning bound) If PX and PY |X are distributions

that achieves Ep(R), and PY (y) =∑

x∈X PX(x)PY |X(y|x), then

Ep(R) , maxPX

minPY |X : D(PY |X‖PY |PX)≤R


In addition, the distribution PY |X that achieves Ep(R) is a tilted distributionbetween PY |X and PY , i.e.,

Ep(R) , maxPX

D(PYλ|X‖PY |X |PX),

where PYλ|X is a tilted distribution4 between PY |X and PY , and λ is the solutionof

D(PYλ‖PY |PX) = R.

Theorem 9.16 (partitioning bound) For a DMC with marginal PY |X , forany ε > 0 arbitrarily small,

E(R + ε) ≤ Ep(R).

4

PYλ|X(y|x) ,PλY

(y)P 1−λY |X (y|x)∑

y′∈Y PλY

(y′)P 1−λY |X (y′|x)

.

197

Proof:

Step 1: Hypothesis testing. Let code size Mn = d2enRe ≤ en(R+ε) for n >ln2/ε, and let c1, . . . , cMn be the best codebook. (Note that using acode with smaller size should yield a smaller error probability, and hence,a larger exponent. So, if this larger exponent is upper bounded by Ep(R),the error exponent for code size expR + ε should be bounded above byEp(R).)

Suppose that PX and PY |X are the distributions achieving Ep(R). Define

PY (y) =∑

x∈X PX(x)PY |X(y|x). Then for any (mutually-disjoint) decod-ing partitions at the output site, U1,U2, . . . ,UMn ,∑

1≤m′≤Mn

PY n(Um′) = 1,

which implies the existence of m satisfying

PY n(Um) ≤ 2

Mn

.

(Note that there are at least Mn/2 codewords satisfying this condition.)

Construct the binary hypothesis testing problem of

H0 : PY n(·) against H1 : PY n|Xn(·|cm),

and choose the acceptance region for null hypothesis as U cm (i.e., the ac-ceptance region for alternative hypothesis is Um), the type I error

αn , Pr(Um|H0) = PY n(Um) ≤ 2

Mn

,

and the type II error

βn , Pr(U cm|H1) = PY n|Xn(U cm|cm),

which is exactly, Pe|m, the probability of decoding error given codeword cmis transmitted. We therefore have

Pe|m ≥ minαn≤2/Mn

βn ≥ minαn≤e−nR

βn.

(Since

2

Mn

≤ e−nR)

198

Then from Theorem 8.34, the best exponent 5 for Pe|m is

lim supn→∞

1

n

n∑i=1

D(PYλ,i‖PY |X(·|ai)), (9.3.1)

where ai is the ith component of cm, PYλ,i is the tilted distribution betweenPY |X(·|ai) and PY , and

lim infn→∞

1

n

n∑i=1

D(PYλ,i‖PY ) = R.

If we denote the composition distribution with respect to cm as P (cm), andobserve that the number of different tilted distributions (i.e., PYλ,i) is atmost |X |, the quantity of (9.3.1) can be further upper bounded by

lim supn→∞

1

n

n∑i=1

D(PYλ,i‖PY |X(·|ai))

≤ lim supn→∞

∑x∈X

P (cm)(x)D(PYλ,x‖PY |X(·|x))

≤ maxPX

D(PYλ‖PY |X |PX)

= D(PYλ‖PY |X |PX)

= Ep(R). (9.3.2)

Besides, D(PYλ‖PY |PX) = R.

Step 2: A good code with rate R. Within the Mn ≈ 2enR codewords, thereare at least Mn/2 codewords satisfying

PY n(Um) ≤ 2

Mn

,

and hence, their Pe|m should all satisfy (9.3.2). Consequently, if

B , m : those m satisfying step 1,

5According to Theorem 8.34, we need to consider the quantile function of the CDF of

1

nlog

dPY nλdPY n|X=cm

(Y n).

Since both PY n|X(·|cm) and PY n are product distributions, PY nλ is a product distribution,too. Therefore, by the similar reason as in Example 5.13, the CDF becomes a bunch of unitfunctions (see Figure 2.4), and its largest quantile becomes (9.3.1).

199

then

Pe ,1

Mn

∑1≤m≤Mn

Pe|m

≥ 1

Mn

∑m∈B

Pe|m

≥ 1

Mn

∑m∈B

minm∈B

Pe|m

=1

Mn

|B|minm∈B

Pe|m

≥ 1

2minm∈B

Pe|m

=1

2Pe|m′ , (m′ is the minimizer.)

which implies

E(R + ε) ≤ lim supn→∞

− 1

nlogPe ≤ lim sup

n→∞− 1

nlogPe|m′ ≤ Ep(R).

2

The partitioning upper bound can be re-formulated to shape similar as ran-dom coding lower bound.

Lemma 9.17

Ep(R) = maxPX

maxs≥0

−sR− log∑y∈Y

(∑x∈X

PX(x)PY |X(y|x)1/(1+s)

)1+s

= maxs≥0

[−sR− E0(s)] .

Recall that the random coding exponent is max0<s≤1[−sR−E0(s)], and theoptimizer s∗ is the slope of Er(R) times −1. Hence, the channel reliability E(R)satisfies

max0<s≤1

[−sR− E0(s)] ≤ E(R) ≤ maxs≥0

[−sR− E0(s)] ,

and for optimizer s∗ ∈ (0, 1] (i.e., Rcr ≤ R ≤ C), the upper bound meets thelower bound.

200

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.192745

Figure 9.5: Partitioning exponent (thick line), random coding exponent(thin line) and expurgated exponent (thin line) for BSC with crossoverprobability 0.2.

9.4 Sphere-packing exponent: an upper bound of thechannel reliability

9.4.1 Problem of sphere-packing

For a given space A and a given distance measures d(·, ·) for elements in A, aball or sphere centered at a with radius r is defined as

b ∈ A : d(b, a) ≤ r.

The problem of sphere-packing is to find the minimum radius if M spheres needto be packed into the space A. Its dual problem is to find the maximum numberof balls with fixed radius r that can be packed into space A.

9.4.2 Relation of sphere-packing and coding

To find the best codebook which yields minimum decoding error is one of themain research issues in communications. Roughly speaking, if two codewords aresimilar, they should be more vulnerable to noise. Hence, a good codebook shouldbe a set of codewords, which look very different from others. In mathematics,such “codeword resemblance” can be modeled as a distance function. We can

201

then say if the distance between two codewords is large, they are more “different”,and more robust to noise or interference. Accordingly, a good code book becomesa set of codewords whose minimum distance among codewords is largest. Thisis exactly the sphere-packing problem.

Example 9.18 (Hamming distance versus BSC) For BSC, the source al-phabet and output alphabet are both 0, 1n. The Hamming distance betweentwo elements in 0, 1 is given by

dH(x, x) =

0, if x = x;1, if x 6= x.

Its extension definition to n-tuple is

dH(xn, xn) =n∑i=1

dH(xi, xi).

It is known that the best decoding rule is the maximum likelihood ratiodecoder, i.e.,

φ(yn) = m, if PY n|Xn(yn|cm) ≥ PY n|Xn(yn|cm′) for all m′.

Since for BSC with crossover probability ε,

PY n|Xn(yn|cm) = εdH(yn,cm)(1− ε)n−dH(yn,cm)

= (1− ε)n(

ε

1− ε

)dH(yn,cm)

. (9.4.1)

Therefore, the best decoding rule can be re-written as:

φ(yn) = m, if dH(yn, cm) ≥ dH(yn, cm′) for all m′.

As a result, if two codewords are too close in Hamming distance, the number ofbits of outputs that can be used to classify its origin will be less, and therefore,will result in a poor performance.

From the above example, two observations can be made. First, if the dis-tance measure between codewords can be written as a function of the transitionprobability of channel, such as (9.4.1), one may regard the probability of decod-ing error with the distances between codewords. Secondly, the coding problemin some cases can be reduced to a sphere-packing problem. As a consequence,solution of the sphere-packing problem can be used to characterize the error pro-bability of channels. This can be confirmed by the next theorem, which showsthat an upper bound on the channel reliability can be obtained in terms of thelargest minimum radius among M disjoint spheres in a code space.

202

Theorem 9.19 Let µn(·, ·) be the Bhattacharya distance6 between two elementsin X n. Denote by dn,M the largest minimum distance among M selected code-words of length n. (Obviously, the largest minimum radius among M disjointspheres in a code space is half of dn,M .) Then

lim supn→∞

− 1

nlogPe(n,R) ≤ lim sup

n→∞

1

ndn,M=enR .

Proof:

Step 1: Hypothesis testing. For any code

c0, c1, . . . , cM−1

given, we can form a maximum likelihood partitions at the output asU1,U2, . . . ,UM , which is known to be optimal. Let Am,m be the optimalacceptance region for alternative hypothesis under equal prior for

testing H0 : PY n|Xn(·|cm) against H1 : PY n|Xn(·|cm),

and denote by Pe|m the error probability given codeword m is transmitted.Then

Pe|m = PY n|Xn(U cm|cm) ≥ PY n|Xn(Acm,m|cm),

where the superscript “c” represents the set complementary operation(cf. Figure 9.6). Consequently,

Pe|m ≥ PY n|Xn(Acm,m|cm) ≥ exp

−D

(PY nλ ‖PY n|Xn(·|cm)

)+ o(n)7

6The Bhattacharya distance (for channels PY n|Xn) between elements xn and xn is defined

by

µn(xn, xn) , − log∑

yn∈Yn

√PY n|Xn(yn|xn)PY n|Xn(yn|xn).

7This is a special notation, introduced in 1909 by E. Landau. Basically, there are two ofthem: one is little-o notation and the other is big-O notation. They are respectively definedas follows, based on the assumption that g(x) > 0 for all x in some interval containing a.

Definition 9.20 (little-o notation) The notation f(x) = o(g(x)) as x→ a means that

limx→a

f(x)

g(x)= 0.

Definition 9.21 (big-O notation) The notation f(x) = O(g(x)) as x→ a means that

limx→a

∣∣∣∣f(x)

g(x)

∣∣∣∣ <∞.

203

and

Pe|m ≥ PY n|Xn(Am,m|cm) ≥ exp−D

(PY nλ ‖PY n|Xn(·|cm)

)+ o(n)

,

where PY nλ is the tilted distribution between PY n|Xn(·|cm) and PY n|Xn(·|cm)with

D(PY nλ ‖PY n|Xn(·|cm)) = D(PY nλ ‖PY n|Xn(·|cm)) = µn(cm, cm),

and D(·‖·) is the divergence (or Kullback-Leibler divergence). We thushave

Pe|m + Pe|m ≥ 2e−µn(cm,cm)+o(n).

Note that the above inequality holds for any m and m with m 6= m.

Step 2: Largest minimum distance. By the definition of dn,M , there existsan (m,m) pair for the above code such that

µn(cm, cm) ≤ dn,M ,

which implies

(∃ (m,m)) Pe|m + Pe|m ≥ 2e−dn,M+o(n).

Step 3: Probability of error. Suppose we have found the optimal code withsize 2M = en(R+log 2/n), which minimizes the error probability. Index thecodewords in ascending order of Pe|m, namely

Pe|0 ≤ Pe|1 ≤ . . . ≤ Pe|2M−1.

Form two new codebooks as

C∼1 = c0, c1, . . . , cM−1 and C∼2 = cM , . . . , c2M−1.

Then, from steps 1 and 2, there exists at least one pair of codewords(cm, cm) in C∼1 such that

Pe|m + Pe|m ≥ 2e−dn,M+o(n).

Since for all ci in C∼2

Pe|i ≥ maxPe|m, Pe|m,

and hence,Pe|i + Pe|j ≥ Pe|m + Pe|m ≥ 2e−dn,M+o(n)

204

for any ci and cj in C∼2. Accordingly,

Pe(n,R + log(2)/n) =1

8M2

2M−1∑i=0

2M−1∑j=0

(Pe|i + Pe|j)

≥ 1

8M2

2M−1∑i=M

2M−1∑j=M,j 6=i

(Pe|i + Pe|j)

≥ 1

8M2

2M−1∑i=M

2M−1∑j=M,j 6=i

(Pe|m + Pe|m)

≥ 1

8M2

2M−1∑i=M

2M−1∑j=M,j 6=i

2e−dn,M+o(n)

=M − 1

4Me−dn,M+o(n),

which immediately implies the desired result. 2

Since, according the above theorem, the largest minimum distance can beused to formulate an upper bound on channel reliability, this quantity becomesessential. We therefore introduce its general formula in the next subsection.

9.4.3 The largest minimum distance of block codes

The ultimate capabilities and limitations of error correcting codes are quite im-portant, especially for code designers who want to estimate the relative efficacyof the designed code. In fairly general situations, this information is closely re-lated to the largest minimum distance of the codes [16]. One of the examples isthat for binary block code employing the Hamming distance, the error correctingcapability of the code is half of the minimum distance among codewords. Hence,the knowledge of the largest minimum distance can be considered as a referenceof the optimal error correcting capability of codes.

The problem on the largest minimum distance can be described as follows.Over a given code alphabet, and a given measurable function on the “distance”between two code symbols, determine the asymptotic ratio, the largest minimumdistance attainable amongM selected codewords divided by the code blocklengthn, as n tends to infinity, subject to a fixed rate R , log(M)/n.

Research on this problem have been done for years. In the past, only boundson this ratio were established. The most well-known bound on this problemis the Varshamov-Gilbert lower bound, which is usually derived in terms of acombinatorial approximation under the assumption that the code alphabet is

205

finite and the measure on the “distance” between code letters is symmetric [14].If the size of the code alphabet, q, is an even power of a prime, satisfying q ≥ 49,and the distance measure is the Hamming distance, a better lower bound canbe obtained through the construction of the Algebraic-Geometric code [11, 19],of which the idea was first proposed by Goppa. Later, Zinoviev and Litsynproved that a better lower bound than the Varshamov-Gilbert bound is actuallypossible for any q ≥ 46 [21]. Other improvements of the bounds can be found in[9, 15, 20].

In addition to the combinatorial techniques, some researchers also apply theprobabilistic and analytical methodologies to this problem. For example, bymeans of the random coding argument with expurgation, the Varshamov-Gilbertbound in its most general form can be established by simply using the Chebyshevinequality ([3] or cf. Appendix A), and restrictions on the code alphabet (suchas finite, countable, . . .) and the distance measure (such as additive, symmetric,bounded, . . .) are no longer necessary for the validity of its proof.

As shown in the Chapter 5, channels without statistical assumptions suchas memoryless, information stability, stationarity, causality, and ergodicity, . . .,etc. are successfully handled by employing the notions of liminf in probabilityand limsup in probability of the information spectrum. As a consequence, thechannel capacity C is shown to equal the supremum, over all input processes,of the input-output inf-information rate defined as the liminf in probability ofthe normalized information density [12]. More specifically, given a channel W =W n = PY n|Xn∞n=1,

C = supX

sup a ∈ < :

lim supn→∞

Pr

[1

niXnWn(Xn;Y n) < a

]= 0

,

where Xn and Y n are respectively the n-fold input process drawn from

X = Xn = (X(n)1 , . . . , X(n)

n )∞n=1

and the corresponding output process induced by Xn via the channel W n =PY n|Xn , and

1

niXnWn(xn; yn) ,

1

nlog

PY n|Xn(yn|xn)

PY n(yn)

is the normalized information density. If the conventional definition of chan-nel capacity, which requires the existence of reliable block codes for all suffi-ciently large blocklengths, is replaced by that reliable codes exist for infinitelymany blocklengths, a new optimistic definition of capacity C is obtained [12]. Its

206

information-spectrum expression is then given by [8, 7]

C = supX

sup a ∈ < :

lim infn→∞

Pr

[1

niXnWn(Xn;Y n) < a

]= 0

.

Inspired by such probabilistic methodology, together with random-codingscheme with expurgation, a spectrum formula on the largest minimum distanceof deterministic block codes for generalized distance functions8 (not necessarilyadditive, symmetric and bounded) is established in this work. As revealed in theformula, the largest minimum distance is completely determined by the ultimatestatistical characteristics of the normalized distance function evaluated undera properly chosen random-code generating distribution. Interestingly, the newformula has an analogous form to the general information-spectrum expressionsof the channel capacity and the optimistic channel capacity. This somehowconfirms the connection between the problem of designing a reliable code for agiven channel and that of finding a code with sufficiently large distance amongcodewords, if the distance function is properly defined in terms of the channelstatistics.

With the help of the new formula, we characterize a minor class of distancemetrics for which the ultimate largest minimum distance among codewords canbe derived. Although these distance functions may be of secondary interest, itsheds some light on the determination of largest minimum distance for a moregeneral class of distance functions. Discussions on the general properties of thenew formula will follow.

We next derive a general Varshamov-Gilbert lower bound directly from thenew distance-spectrum formula. Some remarks regarding to its properties aregiven. A sufficient condition under which the general Varshamov-Gilbert boundis tight, as well as examples to demonstrate its strict inferiority to the distance-spectrum formula, is also provided.

Finally, we demonstrate that the new formula can be used to derive theknown lower bounds for a few specific block coding schemes of general interests,

8Conventionally, a distance or metric [13][18, pp. 139] should satisfy the properties of i)non-negativity; ii) being zero if, and only if, two points coincide; iii) symmetry; and iv) triangleinequality. The derivation in this paper, however, is applicable to any measurable functiondefined over the code alphabets. Since none of the above four properties are assumed, themeasurable function on the “distance” between two code letters is therefore termed generalizeddistance function. One can, for example, apply our formula to situation where the codealphabet is a distribution space, and the “distance” measure is the Kullback-Leibler divergence.For simplicity, we will abbreviate the generalized distance function simply as the distancefunction in the remaining part of the article.

207

such as constant weight codes and the codes that corrects arbitrary noise. Trans-formation of the asymptotic distance determination problem into an alternativeproblem setting over a graph for a possible improvement of these known boundsare also addressed.

A) Distance-spectrum formula on the largest minimum distance ofblock codes

We first introduce some notations. The n-tuple code alphabet is denoted by X n.For any two elements xn and xn in X n, we use µn(xn, xn) to denote the n-foldmeasure on the “distance” of these two elements. A codebook with block lengthn and size M is represented by

C∼n,M ,c

(n)0 , c

(n)1 , c

(n)2 , . . . , c

(n)M−1

,

where c(n)m , (cm1, cm2, . . . , cmn), and each cmk belongs to X . We define the

minimum distance

dm( C∼n,M) , min0≤m≤M−1

m 6=m

µn

(c

(n)m , c(n)

m

),

and the largest minimum distance

dn,M , maxC∼n,M

min0≤m≤M−1

dm( C∼n,M).

Note that there is no assumption on the code alphabet X and the sequence ofthe functions µn(·, ·)n≥1.

Based on the above definitions, the problem considered in this paper becomesto find the limit, as n→∞, of dn,M/n under a fixed rate R = log(M)/n. Sincethe quantity is investigated as n goes to infinity, it is justified to take M = enR

as integers.

The concept of our method is similar to that of the random coding techniqueemployed in the channel reliability function [2]. Each codeword is assumed to beselected independently of all others from X n through a generic distribution PXn .Then the distance between codewords c

(n)m and c

(n)m becomes a random variable,

and so does dm( C∼n,M). For clarity, we will use Dm to denote the random variable

corresponding to dm( C∼n,M). Also note that DmM−1m=0 are identically distributed.

We therefore have the following lemma.

Lemma 9.22 Fix a triangular-array random process

X =Xn =

(X

(n)1 , . . . , X(n)

n

)∞n=1

.

208

Let Dm = Dm(Xn), 0 ≤ m ≤ (M − 1), be defined for the random codebookof block length n and size M , where each codeword is drawn independentlyaccording to the distribution PXn . Then

1. for any γ > 0, there exists a universal constant α = α(γ) ∈ (0, 1) (in-dependent of blocklength n) and a codebook sequence C∼n,αMn≥1 suchthat

1

nmin

0≤m≤αM−1dm( C∼n,αM)

> inf

a ∈ < : lim sup

n→∞Pr

[1

nDm > a

]= 0

− γ, (9.4.2)

for infinitely many n;

2. for any γ > 0, there exists a universal constant α = α(γ) ∈ (0, 1) (in-dependent of blocklength n) and a codebook sequence C∼n,αMn≥1 suchthat

1

nmin

0≤m≤αM−1dm( C∼n,αM)

> inf

a ∈ < : lim inf

n→∞Pr

[1

nDm > a

]= 0

− γ, (9.4.3)

for sufficiently large n.

Proof: We will only prove (9.4.2). (9.4.3) can be proved by simply following thesame procedure.

Define

LX(R) , inf

a : lim sup

n→∞Pr

[1

nDm > a

]= 0

.

Let 1(A) be the indicator function of a set A, and let

φm , 1(

1

nDm > LX(R)− γ

).

By definition of LX(R),

lim supn→∞

Pr

[1

nDm > LX(R)− γ

]> 0.

Let 2α , lim supn→∞ Pr[(1/n)Dm > LX(R)− γ

]. Then for infinitely many n,

Pr

[1

nDm > LX(R)− γ

]> α.

209

For those n that satisfies the above inequality,

E

[M−1∑m=0

φm

]=

M−1∑m=0

E [φm] > αM,

which implies that among all possible selections, there exist (for infinite manyn) a codebook C∼n,M in which αM codewords satisfy φm = 1, i.e.,

1

ndm( C∼n,M) > LX(R)− γ

for at least αM codewords in the codebook C∼n,M . The collection of these αMcodewords is a desired codebook for the validity of (9.4.2). 2

Our second lemma concerns the spectrum of (1/n)Dm.

Lemma 9.23 Let each codeword be independently selected through the distri-bution PXn . Suppose that Xn is independent of, and has the same distributionas, Xn. Then

Pr

[1

nDm > a

]≥(Pr

1

nµn

(Xn, Xn

)> a

)M.

Proof: Let C(n)m denote the m-th randomly selected codeword. From the defi-

nition of Dm, we have

Pr

1

nDm > a

∣∣∣∣C(n)m

= Pr

min

0≤m≤M−1m 6=m

1

nµn

(C

(n)m ,C(n)

m

)> a

∣∣∣∣∣C(n)m

=∏

0≤m≤M−1m 6=m

Pr

1

nµn

(C

(n)m ,C(n)

m

)> a

∣∣∣∣C(n)m

(9.4.4)

=

(Pr

1

nµn(Xn, Xn) > a

∣∣∣∣Xn

)M−1

,

where (9.4.4) holds because1

nµn

(C

(n)m ,C(n)

m

)0≤m≤M−1

m 6=m

210

is conditionally independent given C(n)m . Hence,

Pr

1

nDm > a

=

∫Xn

(Pr

1

nµn(Xn, xn) > a

)M−1

dPXn(xn)

≥∫Xn

(Pr

1

nµn(Xn, xn) > a

)MdPXn(xn)

= EXn

[(Pr

1

nµn

(Xn, Xn

)> a

∣∣∣∣Xn

)M]

≥ EMXn

[Pr

1

nµn

(Xn, Xn

)> a

∣∣∣∣Xn

](9.4.5)

=

(Pr

1

nµn

(Xn, Xn

)> a

)M,

where (9.4.5) follows from Lyapounov’s inequality [1, page 76], i.e., E1/M [UM ] ≥E[U ] for a non-negative random variable U . 2

We are now ready to prove the main theorem of the paper. For simplicity,throughout the article, Xn and Xn are used specifically to denote two indepen-dent random variables having common distribution PXn .

Theorem 9.24 (distance-spectrum formula)

supX

ΛX(R) ≥ lim supn→∞

dn,Mn≥ sup

XΛX(R + δ) (9.4.6)

and

supX

ΛX(R) ≥ lim infn→∞

dn,Mn≥ sup

XΛX(R + δ) (9.4.7)

for every δ > 0, where

ΛX(R) , inf a ∈ < :

lim supn→∞

(Pr

1

nµn(Xn, Xn) > a

)M= 0

and

ΛX(R) , inf a ∈ < :

lim infn→∞

(Pr

1

nµn(Xn, Xn) > a

)M= 0

.

211

Proof:

1. lower bound. Observe that in Lemma 9.22, the rate only decreases by theamount − log(α)/n when employing a code C∼n,αM . Also note that for anyδ > 0, − log(α)/n < δ for sufficiently large n. These observations, togetherwith Lemma 9.23, imply the validity of the lower bound.

2. upper bound. Again, we will only prove (9.4.6), since (9.4.7) can be provedby simply following the same procedure.

To show that the upper bound of (9.4.6) holds, it suffices to prove theexistence of X such that

ΛX(R) ≥ lim supn→∞

dn,Mn

.

Let Xn be uniformly distributed over one of the optimal codes C∼ ∗n,M . (By“optimal,” we mean that the code has the largest minimum distance amongall codes of the same size.) Define

λn ,1

nmin

0≤m≤M−1dm( C∼ ∗n,M) and λ , lim sup

n→∞λn.

Then for any δ > 0,

λn > λ− δ for infinitely many n. (9.4.8)

For those n satisfying (9.4.8),

Pr

1

nµn

(Xn, Xn

)> λ− δ

≥ Pr

1

nµn

(Xn, Xn

)≥ λn

≥ Pr

Xn 6= Xn

= 1− 1

M,

which implies

lim supn→∞

(Pr

1

nµn(Xn, Xn) > λ− δ

)M≥ lim sup

n→∞

(1− 1

M

)M= lim sup

n→∞

(1− 1

enR

)enR= e−1 > 0.

Consequently, ΛX(R) ≥ λ− δ. Since δ can be made arbitrarily small, theupper bound holds. 2

212

Observe that supX ΛX(R) is non-increasing in R, and hence, the number ofdiscontinuities is countable. This fact implies that

supX

ΛX(R) = limδ↓0

supX

(R + δ)

except for countably many R. Similar argument applies to supX ΛX(R). Wecan then re-phrase the above theorem as appeared in the next corollary.

Corollary 9.25

lim supn→∞

dn,Mn

= supX

ΛX(R)(resp. lim inf

n→∞

dn,Mn

= supX

ΛX(R)

)except possibly at the points of discontinuities of

supX

ΛX(R) (resp. supX

ΛX(R)),

which are countable.

From the above theorem (or corollary), we can characterize the largest mini-mum distance of deterministic block codes in terms of the distance spectrum. Wethus name it distance-spectrum formula. For convenience, ΛX(·) and ΛX(·) willbe respectively called the sup-distance-spectrum function and the inf-distance-spectrum function in the remaining article.

We conclude this section by remarking that the distance-spectrum formulaobtained above indeed have an analogous form to the information-spectrum for-mulas of the channel capacity and the optimistic channel capacity. Furthermore,by taking the distance metric to be the n-fold Bhattacharyya distance [2, Defi-nition 5.8.3], an upper bound on channel reliability [2, Theorem 10.6.1] can beobtained, i.e.,

lim supn→∞

− 1

nlogPe(n,M = enR) ≤

supX

inf

a : lim sup

n→∞

(Pr

[− 1

nlog

∑yn∈Yn

P1/2Y n|Xn(yn|Xn)P

1/2Y n|Xn(yn|Xn) > a

])M= 0

213

and

lim infn→∞

− 1

nlogPe(n,M = enR) ≤

supX

inf

a : lim inf

n→∞

(Pr

[− 1

nlog

∑yn∈Yn

P1/2Y n|Xn(yn|Xn)P

1/2Y n|Xn(yn|Xn) > a

])M= 0

,

where PY n|Xn is the n-dimensional channel transition distribution from codealphabet X n to channel output alphabet Yn, and Pe(n,M) is the average proba-bility of error for optimal channel code of blocklength n and size M . Note thatthe formula of the above channel reliability bound is quite different from thoseformulated in terms of the exponents of the information spectrum (cf. [6, Sec.V] and [17, equation (14)]).

B) Determination of the largest minimum distance for a class of dis-tance functions

In this section, we will present a minor class of distance functions for which theoptimization input X for the distance-spectrum function can be characterized,and thereby, the ultimate largest minimum distance among codewords can beestablished in terms of the distance-spectrum formula.

A simple example for which the largest minimum distance can be derived interms of the new formula is the probability-of-error distortion measure, which isdefined

µn(xn, xn) =

0, if xn = xn;n, if xn 6= xn.

It can be easily shown that

supX

ΛX(R)

≤ inf a ∈ < :

lim supn→∞

supXn

(Pr

1

nµn(Xn, Xn) > a

)M= 0

=

1, for 0 ≤ R < log |X |;0, for R ≥ log |X |,

and the upper bound can be achieved by letting X be uniformly distributed overthe code alphabet. Similarly,

supX

ΛX(R) =

1, for 0 ≤ R < log |X |;0, for R ≥ log |X |.

214

Another example for which the optimizer X of the distance-spectrum func-tion can be characterized is the separable distance function defined below.

Definition 9.26 (separable distance functions)

µn(xn, xn) , fn(|gn(xn)− gn(xn)|),

where fn(·) and gn(·) are real-valued functions.

Next, we derive the basis for finding one of the optimization distributions for

supXn

Pr

1

nµn(Xn, Xn) > a

under separable distance functionals.

Lemma 9.27 Define G(α) , infU Pr∣∣∣U − U ∣∣∣ ≤ α

, where the infimum is

taken over all (U , U) pair having independent and identical distribution on [0, 1].Then for j = 2, 3, 4, . . . and 1/j ≤ α < 1/(j − 1),

G(α) =1

j.

In addition, G(α) is achieved by uniform distribution over0,

1

j − 1,

2

j − 1, . . . ,

j − 2

j − 1, 1

.

Proof:

Pr∣∣∣U − U ∣∣∣ ≤ α

≥ Pr

∣∣∣U − U ∣∣∣ ≤ 1

j

≥

j−2∑i=0

Pr

(U , U

)∈[i

j,i+ 1

j

)2

+Pr

(U , U

)∈[j − 1

j, 1

]2

=

j−2∑i=0

(Pr

U ∈

[i

j,i+ 1

j

))2

+

(Pr

U ∈

[j − 1

j, 1

])2

≥ 1

j.

215

Achievability of G(α) by uniform distribution over0,

1

j − 1,

2

j − 1, . . . ,

j − 2

j − 1, 1

can be easily verified, and hence, we omit here. 2

Lemma 9.28 For 1/j ≤ α < 1/(j − 1),

1

j≤ inf

UnPr∣∣∣Un − Un∣∣∣ ≤ α

≤ 1

j+

1

n+ 0.5,

where the infimum is taken over all (Un, Un) pair having independent and iden-tical distribution on

0,1

n,

2

n, . . . ,

n− 1

n, 1

,

and j = 2, 3, 4, . . . etc.

Proof: The lower bound follows immediately from Lemma 9.27.

To prove the upper bound, let U∗n be uniformly distributed over0,`

n,2`

n, . . . ,

k`

n, 1

,

where ` = bnαc+ 1 and k is an integer satisfying

n

n/(j − 1) + 1− 1 ≤ k <

n

n/(j − 1) + 1.

(Note thatn

j − 1+ 1 ≥

⌊n

j − 1

⌋+ 1 ≥ `.)

Then

infUnPr∣∣∣Un − Un∣∣∣ ≤ α

≤ Pr

∣∣∣U∗n − U∗n∣∣∣ ≤ α

=1

k + 2

≤ 11

1/(j − 1) + 1/n+ 1

=1

j+

(j − 1)2

j2

1

n+ (j − 1)/j

≤ 1

j+

1

n+ 0.5.

216

2

Based on the above Lemmas, we can then proceed to compute the asymptoticlargest minimum distance among codewords of the following examples. It needsto be pointed out that in these examples, our objective is not to attempt tosolve any related problems of practical interests, but simply to demonstrate thecomputation of the distance spectrum function for general readers.

Example 9.29 Assume that the n-tuple code alphabet is 0, 1n. Let the n-folddistance function be defined as

µn(xn, xn) , |u(xn)− u(xn)|

where u(xn) represents the number of 1’s in xn. Then

supX

ΛX(R)

≤ inf

a ∈ < : lim sup

n→∞

supXn

(Pr

1

n

∣∣∣u(Xn)− u(Xn)∣∣∣ > a

)M= 0

= inf

a ∈ < : lim sup

n→∞(1− inf

XnPr

1

n

∣∣∣u(Xn)− u(Xn)∣∣∣ ≤ a

)M= 0

≤ inf

a ∈ < : lim sup

n→∞(1− inf

UPr∣∣∣U − U ∣∣∣ ≤ a

)expnR= 0

= inf

a ∈ < : lim sup

n→∞

(1− 1

d1/ae

)expnR

= 0

= 0.

Hence, the asymptotic largest minimum distance among block codewords is zero.This conclusion is not surprising because the code with nonzero minimal distanceshould contain the codewords of different Hamming weights and the whole num-ber of such words is n+ 1.

Example 9.30 Assume that the code alphabet is binary. Define

µn(xn, xn) , log2 (|gn(xn)− gn(xn)|+ 1) ,

217

wheregn(xn) = xn−1 · 2n−1 + xn−2 · 2n−2 + · · ·+ x1 · 2 + x0.

Then

supX

ΛX(R)

≤ inf

a ∈ < : lim sup

n→∞supXn(

Pr

1

nlog2

(∣∣∣gn(Xn)− gn(Xn)∣∣∣+ 1

)> a

)M= 0

≤ inf

a ∈ < : lim sup

n→∞(1− inf

UPr

∣∣∣U − U ∣∣∣ ≤ 2na − 1

2n − 1

)expnR

= 0

= inf

a ∈ < : lim supn→∞

1− 1⌈2n − 1

2na − 1

⌉

expnR

= 0

= 1− R

log 2. (9.4.9)

By taking X∗ to be the one under which UK , gn(Xn)/(2n−1) has the distribu-tion as used in the proof of the upper bound of Lemma 9.28, where K , 2n− 1,

218

we obtain

ΛX∗(R)

≥ inf

a ∈ < : lim sup

n→∞1− 1⌈2n − 1

2na − 1

⌉ − 1

K + 0.5

expnR

= 0

= inf

a ∈ < : lim sup

n→∞1− 1⌈2n − 1

2na − 1

⌉ − 1

2n − 0.5

expnR

= 0

= 1− R

log 2.

This proved the achievability of (9.4.9).

C) General properties of distance-spectrum function

We next address some general functional properties of ΛX(R) and ΛX(R).

Lemma 9.31 1. ΛX(R) and ΛX(R) are non-increasing and right-continuousfunctions of R.

2.

ΛX(R) ≥ D0(X) , lim supn→∞

ess inf1

nµn(Xn, Xn) (9.4.10)

ΛX(R) ≥ D0(X) , lim infn→∞

ess inf1

nµn(Xn, Xn) (9.4.11)

where ess inf represents essential infimum9. In addition, equality holds for(9.4.10) and (9.4.11) respectively when

R > R0(X) , lim supn→∞

− 1

nlogPrXn = Xn (9.4.12)

9For a given random variable Z, its essential infimum is defined as

ess inf Z , supz : Pr[Z ≥ z] = 1.

219

and

R > R0(X) , lim infn→∞

− 1

nlogPrXn = Xn, (9.4.13)

provided that

(∀ xn ∈ X n) minxn∈Xn

µn(xn, xn) = µn(xn, xn) = constant. (9.4.14)

3.ΛX(R) ≤ Dp(X) (9.4.15)

for R > Rp(X); andΛX(R) ≤ Dp(X) (9.4.16)

for R > Rp(X), where

Dp(X) , lim supn→∞

1

nE[µn(Xn, Xn)|µn(Xn, Xn) <∞]

Dp(X) , lim infn→∞

1

nE[µn(Xn, Xn)|µn(Xn, Xn) <∞]

Rp(X) , lim supn→∞

− 1

nlogPrµn(Xn, Xn) <∞

and

Rp(X) , lim infn→∞

− 1

nlogPrµn(Xn, Xn) <∞.

In addition, equality holds for (9.4.15) and (9.4.16) respectively when R =Rp(X) and R = Rp(X), provided that [µn(Xn, Xn)|µn(Xn, Xn) <∞] hasthe large deviation type of behavior, i.e., for all δ > 0,

lim infn→∞

− 1

nlogLn > 0, (9.4.17)

where

Ln , Pr

1

n(Yn − E[Yn|Yn <∞]) ≤ −δ

∣∣∣∣Yn <∞ ,and Yn , µn(Xn, Xn).

4. For 0 < R < Rp(X), ΛX(R) = ∞. Similarly, for 0 < R < Rp(X),ΛX(R) =∞.

Proof: Again, only the proof regarding ΛX(R) will be provided. The propertiesof ΛX(R) can be proved similarly.

1. Property 1 follows by definition.

220

2. (9.4.10) can be proved as follows. Let

en(X) , ess inf1

nµn(Xn, Xn),

and hence, D0(X) = lim supn→∞ en(X). Observe that for any δ > 0 andfor infinitely many n,(

Pr

[1

nµn(Xn, Xn) > D0(X)− 2δ

])M≥

(Pr

[1

nµn(Xn, Xn) > en(X)− δ

])M= 1.

Therefore, ΛX(R) ≥ D0(X)−2δ for arbitrarily small δ > 0. This completesthe proof of (9.4.10).

To prove the equality condition for (9.4.10), it suffices to show that for anyδ > 0,

ΛX(R) ≤ D0(X) + δ (9.4.18)

for

R > lim supn→∞

− 1

nlogPr

Xn 6= Xn

.

By the assumption on the range of R, there exists γ > 0 such that

R > − 1

nlogPr

Xn 6= Xn

+ γ for sufficiently large n. (9.4.19)

Then for those n satisfying en(Xn) ≤ D0(X) + δ/2 and (9.4.19) (of whichthere are sufficiently many)(

Pr

1

nµn(Xn, Xn) > D0(X) + δ

)M≤

(Pr

1

nµn(Xn, Xn) > en(Xn) +

δ

2

)M≤

(PrXn 6= Xn

)M(9.4.20)

=(

1− PrXn = Xn

)M,

where (9.4.20) holds because (9.4.14). Consequently,

lim supn→∞

(Pr

1

nµn(Xn, Xn) > D0(X) + δ

)M= 0

which immediately implies (9.4.18).

221

3. (9.4.15) holds trivially if Dp(X) =∞. Thus, without loss of generality, weassume that Dp(X) < ∞. (9.4.15) can then be proved by observing thatfor any δ > 0 and sufficiently large n,(

Pr

1

nYn > (1 + δ)2 · Dp(X)

)M≤

(Pr

1

nYn > (1 + δ) · 1

nE[Yn|Yn <∞]

)M= (Pr Yn > (1 + δ) · E[Yn|Yn <∞])M

= (Pr Yn =∞+ Pr Yn <∞Pr Yn > (1 + δ) · E[Yn|Yn <∞]|Yn <∞)M

≤(Pr Yn =∞+ Pr Yn <∞

1

1 + δ

)M(9.4.21)

=

(1− δ

1 + δPr Yn <∞

)M,

where (9.4.21) follows from Markov’s inequality. Consequently, for R >Rp(X),

lim supn→∞

(Pr

1

nµn(Xn, Xn) > (1 + δ)2 · Dp(X)

)M= 0.

To prove the equality holds for (9.4.15) at R = Rp(X), it suffices to showthe achievability of ΛX(R) to Dp(X) by R ↓ Rp(X), since ΛX(R) is right-continuous. This can be shown as follows. For any δ > 0, we note from(9.4.17) that there exists γ = γ(δ) such that for sufficiently large n,

Pr

1

nYn −

1

nE[Yn|Yn <∞] ≤ −δ

∣∣∣∣Yn <∞ ≤ e−nγ.

222

Therefore, for infinitely many n,(Pr

1

nYn > Dp(X)− 2δ

)M≥

(Pr

1

nYn >

1

nE[Yn|Yn <∞]− δ

)M= (Pr Yn =∞+ Pr Yn <∞

Pr

1

nYn >

1

nE[Yn|Yn <∞]− δ

∣∣∣∣Yn <∞)M=

(1− Pr

1

nYn −

1

nE[Yn|Yn <∞] ≤ −δ

∣∣∣∣Yn <∞Pr Yn <∞)M

≥(1− e−nγ · Pr Yn <∞

)M.

Accordingly, for Rp(X) < R < Rp(X) + γ,

lim supn→∞

(Pr

1

nYn > Dp(X)− 2δ

)M> 0,

which in turn implies that

ΛX(R) ≥ Dp(X)− 2δ.

This completes the proof of achievability of (9.4.15) by R ↓ Rp(X).

4. This is an immediate consequence of

(∀ L > 0)

(Pr

1

nµn(Xn, Xn) > L

)M≥

(1− Pr

µn(Xn, Xn) <∞

)M.

2

Remarks.

• A weaker condition for (9.4.12) and (9.4.13) is

R > lim supn→∞

1

nlog |Sn| and R > lim inf

n→∞

1

nlog |Sn|,

where Sn , xn ∈ X n : PXn(xn) > 0. This indicates an expected resultthat when the rate is larger than log |X |, the asymptotic largest minimumdistance among codewords remains at its smallest value supX D0(X) (resp.supX D0(X)), which is usually zero.

223

• Based on Lemma 9.31, the general relation between ΛX(R) and the spec-trum of (1/n)µn(Xn, Xn) can be illustrated as in Figure 9.7, which showsthat ΛX(R) lies asymptotically within

ess inf1

nµ(Xn, Xn)

and1

nE[µn(Xn, Xn)|µn(Xn, Xn)]

for Rp(X) < R < R0(X). Similar remarks can be made on ΛX(R). Onthe other hand, the general curve of ΛX(R) (similarly for ΛX(R)) can beplotted as shown in Figure 9.8. To summarize, we remark that under fairlygeneral situations,

ΛX(R)

=∞ for 0 < R < Rp(X);= Dp(X) at R = Rp(X);∈ (D0(X), Dp(X)] for Rp(X) < R < R0(X);= D0(X) for R ≥ R0(X).

• A simple universal upper bound on the largest minimum distance amongblock codewords is the Plotkin bound. Its usual expression is given by [10]for which a straightforward generalization (cf. Appendix B) is

supX

lim supn→∞

1

nE[µn(Xn, Xn)].

Property 3 of Lemma 9.31, however, provides a slightly better form for thegeneral Plotkin bound.

We now, based on Lemma 9.31, calculate the distance-spectrum function ofthe following examples. The first example deals with the case of infinite codealphabet, and the second example derives the distance-spectrum function underunbounded generalized distance measure.

Example 9.32 (continuous code alphabet) Let

X = [0, 1),

and let the marginal distance metric be

µ(x1, x2) = min 1− |x1 − x2|, |x1 − x2| .

Note that the metric is nothing but treating [0, 1) as a circle (0 and 1 areglued together), and then to measure the shorter distance between two posi-tions. Also, the additivity property is assumed for the n-fold distance function,i.e., µn(xn, xn) ,

∑ni=1 µ(xi, xi).

224

Using the product X of uniform distributions over X , the sup-distance-spectrum function becomes

ΛX(R) = inf

a ∈ < : lim sup

n→∞(1− Pr

1

n

n∑i=1

µ(Xi, Xi) ≤ a

)M

= 0

.

By Cramer Theorem [4],

limn→∞

− 1

nPr

1

n

n∑i=1

µ(Xi, Xi) ≤ a

= IX(a),

whereIX(a) , sup

t>0

−ta− logE[e−t·µ(X,X)]

is the large deviation rate function10. Since IX(a) is convex in a, there exists asupporting line to it satisfying

IX(a) = −t∗a− logE[e−t

∗·µ(X,X)],

which implies

a = −s∗IX(a)− s∗ · logE[e−µ(X,X)/s∗

]for s∗ , 1/t∗ > 0; or equivalently, the inverse function of IX(·) is given as

I−1X (R) = −s∗R− s∗ · logE

[e−µ(X,X)/s∗

](9.4.22)

= sups>0

−sR− s · logE

[e−µ(X,X)/s

],

where11 the last step follows from the observation that (9.4.22) is also a support-ing line to the convex I−1

X (R). Consequently,

ΛX(R) = inf a ∈ < : IX(a) < R= sup

s>0

−sR− s · log

[2s ·

(1− e−1/2s

)], (9.4.23)

10We take the range of supremum to be [t > 0] (instead of [t ∈ <] as conventional largedeviation rate function does) since what concerns us here is the exponent of the cumulativeprobability mass.

11One may notice the analog between the expression of the large deviation rate functionIX(a) and that of the error exponent function [2, Thm. 4.6.4] (or the channel reliability expo-nent function [2, Thm. 10.1.5]). Here, we demonstrate in Example 9.32 the basic procedure of

225

which is plotted in Figure 9.9.

Also from Lemma 9.31, we can easily compute the marginal points of thedistance-spectrum function as follows.

R0(X) = − logPrX = X =∞;

D0(X) = ess infµ(X,X) = 0;

Rp(X) = − logPrµ(X,X) <∞ = 0;

Dp(X) = E[µ(X,X)|µ(X,X) <∞] =1

4.

Example 9.33 Under the case that

X = 0, 1

and µn(·, ·) is additive with marginal distance metric µ(0, 0) = µ(1, 1) = 0,µ(0, 1) = 1 and µ(1, 0) = ∞, the sup-distance-spectrum function is obtainedusing the product of uniform (on X ) distributions as:

ΛX(R) = inf a ∈ < : IX(a) < R

= sups>0

−sR− s · log

2 + e−1/s

4

, (9.4.24)

where

IX(a) , supt>0

−ta− logE[e−t·µ(X,X)]

= sup

t>0

−ta− log

2 + e−t

4

.

This curve is plotted in Figure 9.10. It is worth noting that there exists a region

obtaining

inf

a ∈ < : sup

t>0

(−ta− logE

[e−tZ

])< R

= sup

s>0

−sR− s · logE

[e−Z/s

],

for a random variable Z so that readers do not have to refer to literatures regarding to errorexponent function or channel reliability exponent function for the validity of the above equality.This equality will be used later in Examples 9.33 and 9.36, and also, equations (9.4.27) and(9.4.29).

226

that the sup-distance-spectrum function is infinity. This is justified by deriving

R0(X) = − logPrX = X = log 2;

D0(X) = ess infµ(X,X) = 0;

Rp(X) = − logPrµ(X,X) <∞ = log4

3;

Dp(X) = E[µ(X,X)|µ(X,X) <∞] =1

3.

One can draw the same conclusion by simply taking the derivative of (9.4.24)with respect to s, and obtaining that the derivative

−R− log(2 + e−1/s

)− e−1/s

s (2 + e−1/s)+ log(4)

is always positive when R < log(4/3). Therefore, when 0 < R < log(4/3), thedistance-spectrum function is infinity.

From the above two examples, it is natural to question whether the formulaof the largest minimum distance can be simplified to the quantile function12 ofthe large deviation rate function (cf. (9.4.23) and (9.4.24)), especially when thedistance functional is symmetric and additive. Note that the quantile functionof the large deviation rate function is exactly the well-known Varshamov-Gilbertlower bound (cf. the next section). This inquiry then becomes to find the answerof an open question: under what conditions is the Varshamov-Gilbert lower boundtight? Some insight on this inquiry will be discussed in the next section.

D) General Varshamov-Gilbert lower bound

In this section, a general Varshamov-Gilbert lower bound will be derived directlyfrom the distance-spectrum formulas. Conditions under which this lower boundis tight will then be explored.

Lemma 9.34 (large deviation formulas for ΛX(R) and ΛX(R))

ΛX(R) = infa ∈ < : ¯

X(a) < R

andΛX(R) = inf a ∈ < : `X(a) < R

12Note that the usual definition [1, page 190] of the quantile function of a non-decreasingfunction F (·) is defined as: supθ : F (θ) < δ. Here we adopt its dual definition for a non-increasing function I(·) as: infa : I(a) < R. Remark that if F (·) is strictly increasing (resp.I(·) is strictly decreasing), then the quantile is nothing but the inverse of F (·) (resp. I(·)).

227

where ¯X(a) and `X(a) are respectively the sup- and the inf-large deviation

spectrums of (1/n)µn(Xn, Xn), defined as

¯X(a) , lim sup

n→∞− 1

nlogPr

1

nµn(Xn, Xn) ≤ a

and

`X(a) , lim infn→∞

− 1

nlogPr

1

nµn(Xn, Xn) ≤ a

.

Proof: We will only provide the proof regarding ΛX(R). All the properties ofΛX(R) can be proved by following similar arguments.

Define λ , infa ∈ < : ¯

X(a) < R

. Then for any γ > 0,

¯X(λ+ γ) < R (9.4.25)

and¯X(λ− γ) ≥ R. (9.4.26)

Inequality (9.4.25) ensures the existence of δ = δ(γ) > 0 such that for sufficientlylarge n,

Pr

1

nµn(Xn, Xn) ≤ λ+ γ

≥ e−n(R−δ),

which in turn implies

lim supn→∞

(Pr

1

nµn(Xn, Xn) > λ+ γ

)M≤ lim sup

n→∞

(1− e−n(R−δ))enR = 0.

Hence, ΛX(R) ≤ λ + γ. On the other hand, (9.4.26) implies the existence ofsubsequence nj∞j=1 satisfying

limj→∞− 1

njlogPr

1

njµnj(X

nj , Xnj) ≤ λ− γ≥ R,

which in turn implies

lim supn→∞

(Pr

1

nµn(Xn, Xn) > λ− γ

)M≥ lim sup

j→∞

(1− e−njR

)enjR= e−1.

Accordingly, ΛX(R) ≥ λ− γ. Since γ is arbitrary, the lemma therefore holds. 2

228

The above lemma confirms that the distance spectrum function ΛX(·) (resp.ΛX(·)) is exactly the quantile of the sup- (resp. inf-) large deviation spectrumof (1/n)µn(Xn, Xn)n>1 Thus, if the large deviation spectrum is known, so isthe distance spectrum function.

By the generalized Gartner-Ellis upper bound derived in [5, Thm. 2.1], weobtain

¯X(a) ≥ inf

[x≤a]IX(x) = IX(a)

and`X(a) ≥ inf

[x≤a]IX(x) = IX(a),

where the equalities follow from the convexity (and hence, continuity and strictdecreasing) of IX(x) , sup[θ<0][θx − ϕX

(θ)] and IX(x) , sup[θ<0][θx − ϕX(θ)],and

ϕX

(θ) , lim infn→∞

1

nlogE

[eθ·µn(Xn,Xn)

]and

ϕX(θ) , lim supn→∞

1

nlogE

[eθ·µn(Xn,Xn)

].

Based on these observations, the relation between the distance-spectrum expres-sion and the Varshamov-Gilbert bound can be described as follows.

Corollary 9.35

supX

ΛX(R) ≥ supXGX(R) and sup

XΛX(R) ≥ sup

XGX(R)

where

GX(R) , inf a ∈ < : IX(a) < R

= sups>0

[−sR− s · ϕ

X(−1/s)

](9.4.27)

and

GX(R) , infa ∈ < : IX(a) < R

(9.4.28)

= sups>0

[−sR− s · ϕX(−1/s)] . (9.4.29)

Some remarks on the Varshamov-Gilbert bound obtained above are givenbelow.

Remarks.

229

• One can easily see from [2, page 400], where the Varshamov-Gilbert boundis given under Bhattacharyya distance and finite code alphabet, that

supXGX(R)

andsupXGX(R)

are the generalization of the conventional Varshamov-Gilbert bound.

• Since 0 ≤ exp−µn(xn, xn)/s ≤ 1 for s > 0, the function exp−µ(·, ·)/sis always integrable. Hence, (9.4.27) and (9.4.29) can be evaluated underany non-negative measurable function µn(·, ·). In addition, no assumptionon the alphabet space X is needed in deriving the lower bound. Its fullgenerality can be displayed using, again, Examples 2.8 and 9.33, whichresult in exactly the same curves as shown in Figures 9.9 and 9.10.

• Observe that GX(R) and GX(R) are both the pointwise supremum ofa collection of affine functions, and hence, they are both convex, whichimmediately implies their continuity and strict decreasing property on theinterior of their domains, i.e.,

R : GX(R) <∞ and R : GX(R) <∞.

However, as pointed out in [5], ¯X(·) and `X(·) are not necessarily con-

vex, which in turns hints the possibility of yielding non-convex ΛX(·) andΛX(·). This clearly indicates that the Varshamov-Gilbert bound is nottight whenever the asymptotic largest minimum distance among codewordsis non-convex.

An immediate improvement from [5] to the Varshamov-Gilbert boundis to employ the twisted large deviation rate function (instead of IX(·))

JX,h(x) , supθ∈< : ϕX(θ;h)>−∞

[θ · h(x)− ϕX(θ;h)]

and yields a potentially non-convex Varshamov-Gilbert-type bound, whereh(·) is a continuous real-valued function, and

ϕX(θ;h) , lim supn→∞

1

nlogE

[en·θ·h(µn(Xn,Xn)/n)

].

Question of how to find a proper h(·) for such improvement is beyond thescope of this paper and hence is deferred for further study.

• We now demonstrate that ΛX(R) > GX(R) by a simple example.

230

Example 9.36 Assume binary code alphabet X = 0, 1, and n-foldHamming distance µn(xn, xn) =

∑ni=1 µ(xi, xi). Define a measurable func-

tion as follows:

µn(xn, xn) ,

0, if 0 ≤ µn(xn, xn) < αn;αn, if αn ≤ µn(xn, xn) < 2αn;∞, if 2αn ≤ µn(xn, xn),

where 0 < α < 1/2 is a universal constant. Let X be the product ofuniform distributions over X , and let Yi , µ(Xi, Xi) for 1 ≤ i ≤ n.

Then

ϕX

(−1/s)

= lim infn→∞

1

nlogE

[e−µn(Xn,Xn)/s

]= lim inf

n→∞

1

nlog

[Pr

(0 ≤ µn(Xn, Xn)

n< α

)

+Pr

(α ≤ µn(Xn, Xn)

n< 2α

)e−αn/s

]

= lim infn→∞

1

nlog

[Pr

(0 ≤ Y1 + · · ·+ Yn

n< α

)+Pr

(α ≤ Y1 + · · ·+ Yn

n< 2α

)e−αn/s

]= max

−IY (α), −IY (2α)− αn

s

.

where IY (α) , sups>0sα − log[(1 + es)/2] (which is exactly the largedeviation rate function of (1/n)Y n). Hence,

GX(R)

= sups>0

[−sR− s ·max

−IY (α), −IY (2α)− αn

s

]= sup

s>0min s[IY (α)−R], s[IY (2α)−R] + α

=

∞, for 0 ≤ R < IY (2α);α(IY (α)−R)

IY (α)− IY (2α), for IY (2α) ≤ R < IY (α);

0, for R ≥ IY (α).

231

We next derive ΛX(R). Since

Pr

(1

nµn(Xn, Xn) ≤ a

)

=

Pr

(0 ≤ Y1 + · · ·+ Yn

n< α

), 0 ≤ a < α;

Pr

(0 ≤ Y1 + · · ·+ Yn

n< 2α

), α ≤ a < 2α,

we obtain

ΛX(R) =

∞, if 0 ≤ R < IY (2α);α, if IY (2α) ≤ R < IY (α);0, if IY (α) ≤ R.

Consequently, ΛX(R) > GX(R) for IY (2α) < R < IY (α).

• One of the problems that remain open in the combinatorial coding theoryis the tightness of the asymptotic Varshamov-Gilbert bound for the binarycode and the Hamming distance [3, page vii]. As mentioned in Section I, itis already known that the asymptotic Varshamov-Gilbert bound is in gen-eral not tight, e.g., for algebraic-geometric code with large code alphabetsize and Hamming distance. Example 9.36 provides another example toconfirm the untightness of the asymptotic Varshamov-Gilbert bound forsimple binary code and quantized Hamming measure.

By the generalized Gartner-Ellis lower bound derived in [5, Thm. 2.1],we conclude that

¯X(R) = IX(R) (or equivalently ΛX(R) = GX(R))

if [D0(X), Dp(X)

]∈⋃θ<0

[lim sup

t↓0

ϕX

(θ + t)− ϕX

(θ)

t,

lim inft↓0

ϕX

(θ)− ϕX

(θ − t)t

]. (9.4.30)

Note that although (9.4.30) guarantees that ΛX(R) = GX(R), it dose notby any means ensure the tightness of the Varshamov-Gilbert bound. Anadditional assumption needs to be made, which is summarized in the nextobservation.

232

Observation 9.37 If there exists an X such that

supX

ΛX(R) = ΛX(R)

and (9.4.30) holds for X, then the asymptotic Varshamov-Gilbert lowerbound is tight.

Problem in applying the above observation is the difficulty in finding theoptimizing processX. However, it does provide an alternative to prove thetightness of the asymptotic Varshamov-Gilbert bound. Instead of findingan upper bound to meet the lower bound, one could show that (9.4.30)holds for a fairly general class of random-code generating processes, whichthe optimization process surely lies in.

9.4.4 Elias bound: a single-letter upper bound formulaon the largest minimum distance for block codes

As stated in the previous chapter, the number of compositions for i.i.d. sourceis of the polynomial order. Therefore, we can divide any block code C∼ into apolynomial number of constant-composition subcodes. To be more precise

(n+ 1)|X |maxCi| C∼ ∩ Ci| ≥ | C∼| =

∑Ci

| C∼ ∩ Ci| ≥ maxCi| C∼ ∩ Ci|. (9.4.31)

where Ci represents a set of xn with the same composition. As a result of(9.4.31), the rate of code C∼ should be equal to C∼∩C∗ where C∗ is the maximizerof (9.4.31). We can then conclude that any block code should have a constantcomposition subcode with the same rate, and the largest minimum distance forblock codes should be upper bounded by that for constant composition blockcodes.

The idea of the Elias bound for i.i.d. source is to upper bound the largestminimum distance among codewords by the average distance among those code-words on the spherical surface. Note that the minimum distance among code-words within a spherical surface should be smaller than the minimum distanceamong codewords on the spherical surface, which in terms is upper bounded byits corresponding average distance.

By observing that points locate on the spherical surface in general have con-stant joint composition with the center, we can model the set of spherical surfacein terms of joint composition13. To be more precise, the spherical surface cen-tered at xn with radius r can be represented by

Axn,r , xn : d(xn, xn) = r.

13For some distance measures, points on the spherical surface may have more than one joint

233

(Recall that the sphere is modeled as xn : d(xn, xn) ≤ r.)

d(xn, xn) ,n∑`=1

d(x`, x`)

=K∑i=1

K∑j=1

#ij(xn, xn) · d(i, j),

where #ij(xn, xn) represents the number of (x`, x`) pair that is equal to (i, j),and X = 1, 2, . . . , K is the code alphabet. Hence, for any xn in Axn,r, it issufficient for (xn, xn) to have constant joint composition. Therefore, Axn,r canbe re-defined as

Axn,r , xn : #ij(xn, xn) = nij,

where nij1≤i≤K,1≤j≤K is a fixed joint composition.

Example 9.38 Assume a binary alphabet. Suppose n = 4 and the marginalcomposition is #0,#1 = 2, 2. Then the set of all words with such compo-sition is

0011, 0101, 0110, 1010, 1100, 1001

which has 4!/(2!2!) = 6 elements. Choose the joint composition to be

#00,#01,#10,#11 = 1, 1, 1, 1.

Then the set of the spherical surface centered at (0011) is

0101, 1001, 0110, 1010,

which has [2!/(1!1!)][2!/(1!1!)] = 4 elements.

Lemma 9.39 (average number of composition-constant codewords onspherical surface) Fix a composition ni1≤i≤K for code alphabet

1, 2, . . . , K.

compositions with the center. For example, if d(0, 1) = d(0, 2) = 1 for ternary alphabet, thenboth 0011 and 0022 locate on the spherical surface

x4 : d(0000, x4) = 2;

however, they have different joint composition with the center. In this subsection, what weconcern are the general distance measures, and hence, points on spherical surface in generalhave constant joint composition with the centers.

234

The average number of Mn codewords of blocklength n on a spherical surfacedefined over a joint composition nij1≤i≤K,1≤j≤K (with

∑Kj=1 nij =

∑Kj=1 nji =

ni) is equal to

T = Mn

(K∏i=1

ni

)2

n!K∏i=1

K∏j=1

nij!

.

Proof:

Step 1: Number of words. The number of words that have a compositionni1≤i≤K is N , n!/

∏Ki=1 ni!. The number of words on the spherical

surface centered at xn (having a composition ni1≤i≤K)

A(xn) , xn ∈ X n : #ij(xn, xn) = nij

is equal to

J =K∏i=1

ni!K∏j=1

nij!

.

Note that since∑K

i=1 = nj, A(xn) is equivalent to

A(xn) , xn ∈ X n : #ixn = ni and #ij(xn, xn) = nij.

Step 2: Average. Consider all spherical surfaces centered at words with com-position ni1≤i≤K . There are N of them, which we index by r in a fixedorder, i.e., A1,A2, . . . ,Ar, . . . ,AN . Form a code book C∼ with Mn code-words, each has composition ni1≤i≤K . Because each codeword sharesjoint composition nij with J words with composition ni1≤i≤K , it is onJ spherical surfaces among A1, . . . ,AN . In other words, if Tr denote thenumber of codewords in Ar, i.e., Tr , | C∼ ∩ Ar|, then

N∑r=1

Tr = MnJ.

Hence, the average number of codewords on one spherical surface satisfies

T =MnJ

N= Mn

(∏Ki=1 ni

)2

n!∏K

i=1

∏Kj=1 nij!

.

2

235

Observation 9.40 (Reformulating T )

a(n)en(R−I(X;X)) < T < b(n)ε21(n)en(R−I(X;X)).

where PXX is the composition distribution to nij1≤i≤K,1≤j≤K defined in theprevious Lemma, and a(n) and b(n) are both polynomial functions of n.

Proof: We can write T as

T = Mnn!∏K

i=1

∏Kj=1 nij!

∏Ki=1 nin!

∏Ki=1 nin!

.

By the Stirling’s approximation14, we obtain

ε1(n)enH(X,X) <n!∏K

i=1

∏Kj=1 nij!

< ε2(n)enH(X,X),

and

ε1(n)enH(X) <n!∏Ki=1 ni!

< ε2(n)enH(X),

where15 ε1(n) = 6−2Kn(1−2K)/2 and ε2(n) = 6√n, which are both polynomial

functions of n. Together with the fact that H(X) = H(X) and Mn = enR, wefinally obtain

ε1(n)ε22(n)en(R+H(X,X)−H(X)−H(X)) < T < ε2(n)ε2

1(n)en(R+H(X,X)−H(X)−H(X)).

14Stirling’s approximation:

√2nπ

(ne

)n< n! <

√2nπ

(ne

)n(1 +

1

12n− 1

).

15Using a simplified Stirling’s approximation as

√n(ne

)n< n! < 6

√n(ne

)n,

we have√n(ne

)n∏Ki=1

∏Kj=1 6

√nij(nije

)nij < n!∏Ki=1

∏Kj=1 nij !

<6√n(ne

)n∏Ki=1

∏Kj=1

√nij(nije

)nij ⇔√

n∏Ki=1

∏Kj=1 nij

6−2Knn∏Ki=1

∏Kj=1 n

nijij

<n!∏K

i=1

∏Kj=1 nij !

<

√n∏K

i=1

∏Kj=1 nij

6nn∏Ki=1

∏Kj=1 n

nijij

.

Since

n(1−2K)/2 ≤√

n∏Ki=1

∏Kj=1 nij

≤√n,

236

2

This observations tell us that if we choose Mn sufficiently large, say R >I(X; X), then T goes to infinity. We can then conclude that there exists aspherical surface such that Tr = | C∼ ∩ Ar| goes to infinity. Since the minimumdistance among all codewords must be upper bounded by the average distanceamong codewords on spherical surface Ar, an upper bound of the minimumdistance can be established via the next theorem.

Theorem 9.41 For Hamming(-like) distance, which is defined by

d(i, j) =

0, if i = j,A, if i 6= j

,

d(R) ≤ maxPX

minPXX

: PX=PX

=PX ,I(X;X)<R

K∑i=1

K∑j=1

K∑k=1

PXX(i, k)PXX(j, k)

PX(k)d(i, j).

Proof:

Step 1: Notations. Choose a composition-constant code C∼ with rate R andcomposition ni1≤i≤K , and also choose a joint-composition

nij1≤i≤K,1≤j≤K

whose marginal compositions are both ni1≤i≤K , and satisfies I(X; X) <R, where PXX is its joint composition distribution. From Observation 9.40,

the above derivation can be continued as

6−2Kn(1−2K)/2 nn∏Ki=1

∏Kj=1 n

nijij

<n!∏K

i=1

∏Kj=1 nij !

< 6√n

nn∏Ki=1

∏Kj=1 n

nijij

⇔ ε1(n)nn∏K

i=1

∏Kj=1 n

nijij

<n!∏K

i=1

∏Kj=1 nij !

< ε2(n)nn∏K

i=1

∏Kj=1 n

nijij

.

Finally,

nn∏Ki=1

∏Kj=1 n

nijij

= en logn−∑Ki=1

∑Kj=1 nij lognij

= e∑Ki=1

∑Kj=1 nij logn−

∑Ki=1

∑Kj=1 nij lognij

= e−∑Ki=1

∑Kj=1 nij log

nij

n

= e−n∑Ki=1

∑Kj=1 PXX(i,j) logPXX(i,j)

= enH(X,X).

237

if I(X; X) < R, T increases exponentially with respective to n. We canthen find a spherical surface Ar such that the number of codewords in it,namely Tr = | C∼ ∩ Ar|, increases exponentially in n, where the sphericalsurface is defined on the joint composition nij1≤i≤K,1≤j≤K .

Define

dr ,

∑xn∈ C∼∩Ar

∑xn∈ C∼∩Ar,xn 6=xn

d(xn, xn)

Tr(Tr − 1).

Then it is obvious that the minimum distance among codewords is upperbounded by dr.

Step 2: Upper bound of dr.

dr ,

∑xn∈ C∼∩Ar


d(xn, xn)

Tr(Tr − 1)

=

∑xn∈ C∼∩Ar


(K∑i=1

K∑j=1

#ij(xn, xn)d(i, j)

)Tr(Tr − 1)

.

Let 1ij|`(xn, xn) = 1 if x` = i and x` = j; and zero, otherwise. Then

dr =

∑xn∈ C∼∩Ar


(K∑i=1

K∑j=1

[n∑`=1

1ij|`(xn, xn)

]d(i, j)

)Tr(Tr − 1)

=

K∑i=1

K∑j=1

n∑`=1

∑xn∈ C∼∩Ar


1ij|`(xn, xn)

d(i, j)

Tr(Tr − 1)

Define

Ti|` = |xn ∈ C∼ ∩ Ar : x` = i|, and λi|` =Ti|`Tr.

Then∑K

i=1 Ti|` = Tr. (Hence, λi|`Ki=1 is a probability mass function.) Inaddition, suppose νj|` is 1 when the `-th component of the center of thespherical surface Ar is j; and 0, otherwise. Then

n∑`=1

Ti|`νj|` = Trnij,

238

By the fact that

n∑`=1

αi|`αj|` −n∑`=1

α∗i|`α∗j|`

=n∑`=1

(αi|`αj|` − α∗i|`α∗j|`)

=n∑`=1

(αi|`αj|` − αi|`α∗j|` + α∗i|`αj|` − α∗i|`α∗j|`)

=n∑`=1

(αi|` − α∗i|`)(αj|` − α∗j|`)

=n∑`=1

ci|`cj|`,

where ci|` , αi|` − α∗i|`, we obtain

K∑i=1

K∑j=1

[n∑`=1

λi|`λj|`

]d(i, j)−

K∑i=1

K∑j=1

[n∑`=1

λ∗i|`λ∗j|`

]d(i, j)

=K∑i=1

K∑j=1

[n∑`=1

ci|`cj|`

]d(i, j)

=n∑`=1

[K∑i=1

K∑j=1

ci|`cj|`d(i, j)

]

=n∑`=1

A

[K∑i=1

K∑j=1,j 6=i

ci|`cj|`

]

=n∑`=1

A

[K∑i=1

K∑j=1

ci|`cj|` −K∑i=1

c2i|`

]

=n∑`=1

A

[(K∑i=1

ci|`

)(K∑j=1

cj|`

)−

K∑i=1

c2i|`

]

= −n∑`=1

AK∑i=1

c2i|` ≤ 0.

240

Accordingly,

dr ≤Tr

Tr − 1

n∑`=1

K∑i=1

K∑j=1

[K∑k=1

niknjknk

]d(i, j)

=Tr

Tr − 1n

n∑`=1

K∑i=1

K∑j=1

[K∑k=1

PXX(i, k)PXX(j, k)

PX(k)

]d(i, j),

which implies that

drn≤ TrTr − 1

n∑`=1

K∑i=1

K∑j=1

[K∑k=1

PXX(i, k)PXX(j, k)

PX(k)

]d(i, j). (9.4.32)

Step 4: Final step. Since (9.4.32) holds for any joint composition satisfyingthe constraints, and Tr goes to infinity as n goes to infinity, we can takethe minimum of PXX and yield a better bound. However, the marginalcomposition is chosen so that the composition-constant subcode has thesame rate as the optimal code which is unknown to us. Hence, we choosea pessimistic viewpoint to take the maximum over all PX , and completethe proof. 2

The upper bound in the above theorem is called the Elias bound, and willbe denoted by dE(R).

9.4.5 Gilbert bound and Elias bound for Hamming dis-tance and binary alphabet

The (general) Gilbert bound is

G(R) , supX

sups>0

[−sR− s · ϕX(s)] ,

where ϕX(s) , lim supn→∞(1/n) logE[exp−µn(Xn, Xn)/s], and Xn ≡ Xn areindependent. For binary alphabet,

Prµ(X,X) ≤ y =

0, y < 0

Pr(X = X), 0 ≤ y < 11 , y ≥ 1.

Hence,

G(R) = sup0<p<1

sups>0

[−sR− s log

(p2 + (1− p)2 + 2p(1− p)e−1/s

)],

241

where PX(0) = p and PX(1) = 1− p. By taking the derivatives with respectiveto p, the optimizer can be found to be p = 1/2. Therefore,

G(R) = sups>0

[−sR− s log

1 + e−1/s

2

].

The Elias bound for Hamming distance and binary alphabet is equal to

dE(R) , maxPX

minPXX

: PX=PX

=PX ,I(X;X)<R

K∑i=1

K∑j=1

K∑k=1

PXX(i, k)PXX(j, k)

PX(k)d(i, j)

= maxPX

minPXX

: PX=PX

=PX ,I(X;X)<R2

K∑k=1

PXX(0, k)PXX(1, k)

PX(k)

= max0<p<1

min0≤a≤p : a log a

p2+2(p−a) log p−a

p(1−p)+(1−2p+a) log 1−2p+a

(1−p)2=R

2

[a(p− a)

p+

(p− a)(1− 2p+ a)

1− p

],

where PX(0) = p and PXX(0, 0) = a. Since

f(p, a) , 2

[a(p− a)

p+

(p− a)(1− 2p+ a)

1− p

],

is non-increasing with respect to a, by assuming that the solution for

a loga

p2+ 2(p− a) log

p− ap(1− p)

+ (1− 2p+ a) log1− 2p+ a

(1− p)2= R

is a∗p which is a function of p,

dE(R) = max0<p<1

f(p,mina∗p, p).

9.4.6 Bhattacharyya distance and expurgated exponent

The Gilbert bound for additive distance measure µn(xn, xn) =∑n

i=1 µ(xi, xi) canbe written as

G(R) = maxs>0

maxPX

−sR− s log

[∑x∈X

∑x′∈X

PX(x)PX(x′)e−µ(x,x′)/s

]. (9.4.33)

Recall that the expurgated exponent for DMC with generic distribution PY |Xis defined by

Eex(R) , maxs≥1

maxPX

[−sR− s log

∑x∈X

∑x′∈X

PX(x)PX(x′)

242

(∑y∈Y


)1/s ,

which can be re-formulated as:

Eex(R) , maxs≥1

maxPX

−sR− s log

[∑x∈X

∑x′∈X

PX(x)PX(x′)

e−(− log∑y∈Y√PY |X(y|x)PY |X(y|x′))/s

]. (9.4.34)

Comparing (9.4.33) and (9.4.34), we can easily find that the expurgated lowerbound (to reliability function) is nothing but the Gilbert lower bound (to mini-mum distance among codewords) with

µ(x, x′) , − log∑y∈Y

√PY |X(y|x)PY |X(y|x′).

This is called the Bhattacharyya distance for channel PY |X .

Note that the channel reliability function E(R) for DMC satisfies

Eex(R) ≤ E(R) ≤ lim supn→∞

1

ndn,M=en,R ;

also note that

G(R) ≤ 1

ndn,M=en,R

and for optimizer s∗ ≥ 1,Eex(R) = G(R).

We conclude that the channel reliability function for DMC is solved (not only forhigh rate but also for low rate, if the Gilbert bound is tight for Bhattacharyyadistance! This again confirms the significance of the query on when the Gilbertbound being tight!

The final remark in this subsection regarding the resemblance of Bhattachar-ya distance and Hamming distance. The Bhattacharyya distance is in generalnot necessary a Hamming-like distance measure. The channel which results ina Hamming-like Bhattacharyya distance is named equidistance channels. Forexample, any binary input channel is a equidistance channel.

9.5 Straight line bound

It is conjectured that the channel reliability function is convex. Therefore, anynon-convex upper bound can be improved by making it convex. This is the mainidea of the straight line bound.

243

Definition 9.42 (list decoder) A list decoder decodes the outputs of a noisychannel by a list of candidates (possible inputs), and an error occurs only whenthe correct codeword transmitted is not in the list.

Definition 9.43 (maximal error probability for list decoder)

Pe,max(n,M,L) , min C∼ with L candidates for decoder

max1≤m≤M

Pe|m,

where n is the blocklength and M is the code size.

Definition 9.44 (average error probability for list decoder)

Pe(n,M,L) , min C∼ with L candidates for decoder

1

M

∑1≤m≤M

Pe|m,

where n is the blocklength and M is the code size.

Lemma 9.45 (lower bound on average error) For DMC,

Pe(n,M, 1) ≥ Pe(n1,M,L)Pe,max(n2, L+ 1, 1),

where n = n1 + n2.

Proof:

Step 1: Partition of codewords. Partition each codeword into two parts: aprefix with length n1 and a suffix of length n2. Likewise, divide each outputobservations into a prefix part with length n1 and a suffix part with lengthn2.

Step 2: Average error probability. Let

c1, . . . , cM

be the optimal codewords, and U1, . . . ,UM be the corresponding outputpartitions. Denote the prefix parts of the codewords as c1, . . . , cM and thesuffix parts as c1, . . . , cM . Similar notations applies to outputs yn. Then

Pe(n,M, 1)

=1

M

M∑m=1

∑yn 6∈Um

PY n|Xn(yn|cm)

=1

M

M∑m=1

∑yn 6∈Um

PY n1 |Xn1 (yn1|cm)PY n2 |Xn2 (yn2|cm)

=1

M

M∑m=1

∑yn1∈Yn1

∑yn2 6∈Um(yn1 )

PY n1 |Xn1 (yn1|cm)PY n2 |Xn2 (yn2|cm),

244

where

Um(yn1) , yn2 ∈ Yn2 : concatenation of yn1 and yn2 ∈ Um .

Therefore,

Pe(n,M, 1)

=1

M

M∑m=1

∑yn1∈Yn1

PY n1 |Xn1 (yn1|cm)∑

yn2 6∈Um(yn1 )

PY n2 |Xn2 (yn2|cm)

Step 3: |1 ≤ m ≤M : Pe|m( C∼) < Pe,max(n, L+ 1, 1)| ≤ L.

Claim: For any code with size M ≥ L, the number of m that satisfying

Pe|m < Pe,max(n, L+ 1, 1)

is at most L.

proof: Suppose the claim is not true. Then there exist at least L + 1codewords satisfying Pe|m( C∼) < Pe,max(n, L + 1, 1). Form a new code C∼ ′with size L + 1 by these L + 1 codewords. The error probability giveneach transmitted codeword should be upper bounded by its correspondingoriginal Pe|m, which implies that max1≤m≤L+1 Pe|m( C∼ ′) is strictly less thanPe,max(n, L+ 1, 1), contradicted to its definition.

Step 4: Continue from step 2. Treat Um(yn1)Mm=1 as decoding partitionsat the output site for observations of length n2, and denote its resultanterror probability given each transmitted codeword by Pe|m(yn1). Then

Pe(n,M, 1)

=1

M

M∑m=1

∑yn1∈Yn1

PY n1 |Xn1 (yn1|cm)∑

yn2 6∈Um(yn1 )

PY n2 |Xn2 (yn2|cm)

=1

M

M∑m=1

∑yn1∈Yn1

PY n1 |Xn1 (yn1|cm)Pe|m(yn1)

≥ 1

M

∑m∈L

∑yn1∈Yn1

PY n1 |Xn1 (yn1 |cm)Pe,max(n2, L+ 1),

where L is the set of indexes satisfying

Pe|m(yn1) ≥ Pe,max(n2, L+ 1).

Note that |L| ≥M − L+ 1. We finally obtain

Pe(n,M, 1) ≥ Pe,max(n2, L+ 1)1

M

∑m∈L

∑yn1∈Yn1

PY n1 |Xn1 (yn1 |cm).

245

By partitioning Yn1 into |L| mutually disjoint decoding sets, and using(1, 2, . . . ,M − L) plus the codeword itself as a decoding candidate listwhich is at most L,

1

M

∑m∈L

∑yn1∈Yn1

PY n1 |Xn1 (yn1|cm) ≥ Pe(n1,M,L)

Combining this with the previous inequality completes the proof. 2

Theorem 9.46 (straight-line bound)

E(λR1 + (1− λ)R2) ≤ λEp(R1) + (1− λ)Esp(R2).

Proof: By applying the previous lemma with λ = n1/n, log(M/L)/n1 = R1 andlogL/n2 = R2, we can upper bound Pe(n1,M,L) by partitioning exponent, andPe,max(n2, L+ 1, 1) by sphere-packing exponent. 2

246

PYn|Xn(./cm)

PYn|Xn(./cm’)

PYn|Xn(./cm)

PYn|Xn(./cm’)

(a)

(b)

Figure 9.6: (a) The shaded area is U cm; (b) The shaded area is Acm,m′ .

247

ΛX(R) -

ess inf(1/n)µn(Xn, Xn)

(1/n)E[µ(Xn, Xn)|µ(Xn, Xn) <∞]

Probability mass

of (1/n)µn(Xn, Xn)+

Figure 9.7: ΛX(R) asymptotically lies between ess inf(1/n)µ(Xn, Xn)and (1/n)E[µn(Xn, Xn)|µn(Xn, Xn)] for Rp(X) < R < R0(X).

R

ΛX(R)

-

6

∞ cppppppppps

Rp(X)

Dp(X)

cpppsp p p p p p p p p p p

cppps

R0(X)

D0(X)

Figure 9.8: General curve of ΛX(R).

248

0

1/4

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5R

Figure 9.9: Function of sups>0

−sR− s log

[2s(1− e−1/2s

)].

0

1/3

0 log(4/3) log(2)R

infinite region

Figure 9.10: Function of sups>0

−sR− s log

[(2 + e−1/s

)/4]

.

249

Bibliography


[2] Richard E. Blahut. Principles and Practice of Information Theory. AddisonWesley, Massachusetts, 1988.

[3] V. Blinovsky. Asymptotic Combinatorial Coding Theory. Kluwer Acad. Pub-lishers, 1997.

[4] James A. Bucklew. Large Deviation Techniques in Decision, Simulation,and Estimation. Wiley, New York, 1990.

[5] P.-N. Chen, “Generalization of Gartner-Ellis theorem,” submitted to IEEETrans. Inform. Theory, Feb. 1998.


[7] P.-N. Chen and F. Alajaji, “On the optimistic capacity of arbitrary chan-nels,” in proc. of the 1998 IEEE Int. Symp. Inform. Theory, Cambridge,MA, USA, Aug. 16–21, 1998.

[8] P.-N. Chen and F. Alajaji, “Optimistic shannon coding theorems for ar-bitrary single-user systems,” IEEE Trans. Inform. Theory, vol. 45, no. 7,pp. 2623–2629, Nov. 1999.

[9] T. Ericson and V. A. Zinoviev, “An improvement of the Gilbert bound forconstant weight codes,” IEEE Trans. Inform. Theory, vol. IT-33, no. 5,pp. 721–723, Sep. 1987.

[10] R. G. Gallager. Information Theory and Reliable Communications. JohnWiley & Sons, 1968.

[11] G. van der Geer and J. H. van Lint. Introduction to Coding Theory andAlgebraic Geometry. Birkhauser:Basel, 1988.

250


[13] A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. NewYork:Dover Publications, Inc., 1970.

[14] J. H. van Lint, Introduction to Coding Theory. New York:Springer-Verlag,2nd edition, 1992.

[15] S. N. Litsyn and M. A. Tsfasman, “A note on lower bounds,” IEEETrans. on Inform. Theory, vol. IT-32, no. 5, pp. 705–706, Sep. 1986.

[16] J. K. Omura, “On general Gilbert bounds,” IEEE Trans. Inform. Theory,vol. IT-19, no. 5, pp. 661–666, Sep. 1973.

[17] H. V. Poor and S. Verdu, “A lower bound on the probability of error inmultihypothesis testing,” IEEE Trans. on Information Theory, vol. IT–41,no. 6, pp. 1992–1994, Nov. 1995.

[18] H. L. Royden. Real Analysis. New York:Macmillan Publishing Company,3rd edition, 1988.

[19] M. A. Tsfasman and S. G. Vladut. Algebraic-Geometric Codes. Nether-lands:Kluwer Academic Publishers, 1991.

[20] S. G. Vladut, “An exhaustion bound for algebraic-geometric modularcodes,” Probl. Info. Trans., vol. 23, pp. 22–34, 1987.

[21] V. A. Zinoviev and S. N. Litsyn, “Codes that exceed the Gilbert bound,”Problemy Peredachi Informatsii, vol. 21:1, pp. 105–108, 1985.

251

Chapter 10

Information Theory of Networks

In this chapter, we consider the theory regarding to the communications amongmany (more than three) terminals, which is usually named network. Several kindof variations on this research topic are formed. What follow are some of them:

Example 10.1 (multi-access channels) This is a standard problem on dis-tributed source coding (data compression). In this channel, two or more senderssend information to a common receiver. In order to reduce the overall trans-mitted rate on network, data compressions are applied independently amongsenders. As shown in Figure 10.1, we want to find lossless compressors f1, f2,and f3 such that R1 +R2 +R3 is minimized. In this research topic, some otherinteresting problems to study are, for example, “how do the various senderscooperate with each other, i.e., relations of fi’s?” and “What is the relationbetween rates and interferences among senders?”.

Example 10.2 (broadcast channel) Another standard problem on networkinformation theory, which is of great interest in research, is the broadcast chan-nel, such as satellite TV networks, which consists of one sender and many re-ceivers. It is usually assumed that each receiver receives the same broadcastinformation with different noise.

Example 10.3 (distributed detection) This is a variation of the multiaccesschannel in which the information of each sender comes from observing the sametargets. The target could belong to one of several categories. Upon receipt of thecompressed data from these senders, the receiver should decide which categoryis the current observed one under some optimization criterion.

Perhaps the simplest model of the distributed detection consists of only twocategories which usually name null hypothesis and alternative hypothesis. Thecriterion is chosen to be the Bayesian cost or Neyman-Pearson type II errorprobability subject to a fixed bound on the type I error probability. This systemis depicted in Figure 10.2.

252

-

-

-

Inform3

Inform2

Inform1

Terminal3compressor f3

with rate R3


with rate R2


with rate R1

-

-

-

-

-

-Noiselessnetwork(such as

wired net)

-

Terminal4globallydecom-pressor

-

Inform1,Inform2,Inform3

Figure 10.1: The multi-access channel.

Example 10.4 (some other examples)

1) Relay channel. This channel consists of one source sender, one destinationreceiver, and several intermediate sender-receiver pairs that act as relaysto facilitate the communication between the source sender and destinationreceiver.

2) Interference channel. Several senders and several receivers communicate si-multaneously on common channel, where interference among them couldintroduce degradation on performance.

3) Two-way communication channel. Instead of conventional one-way channel,two terminals can communicate in a two-way fashion (full duplex).

10.1 Lossless data compression over distributed sourcesfor block codes

10.1.1 Full decoding of the original sources

In this section, we will first consider the simplest model of the distributed (cor-related) sources which consists only two random variables, and then, extend theresult to those models consisting of three or more random variables.

253

-

-

-

-

-

-

- Fusion center - D(U1, . . . , Un)

gn

g2

g1

···

Yn

Y2

Y1

Un

U2

U1

Figure 10.2: Distributed detection with n senders. Each observations Yimay come from one of two categories. The final decision D ∈ H0, H1.

Definition 10.5 (independent encoders among distributed sources) T-here are several sources

X1, X2, . . . , Xm

(may or may not be independent) which are respectively obtained by m termi-nals. Before each terminal transmits its local source to the receiver, a blockencoder fi with rate R1 and block length n is applied

fi : X ni → 1, 2, . . . , 2nRi.

It is assumed that there is no conspiracy among block encoders.

Definition 10.6 (global decoder for independently compressed source-s) A global decoder g(·) will recover the original sources after receiving all theindependently compressed sources, i.e.,

g : 1, . . . , 2R1 × · · · × 1, . . . , 2Rm → X n1 × · · · × X n

m.

Definition 10.7 (probability of error) The probability of error is defined as

Pe(n) , Prg(f1(Xn1 ), . . . , fm(Xn

m)) 6= (Xn1 , . . . , X

nm).

Definition 10.8 (achievable rates) A rates (R1, . . . , Rm) is said to be achiev-able if there exists a sequence of block codes such that

lim supn→∞

Pe(n) = 0.

Definition 10.9 (achievable rate region for distributed sources) The a-chievable rate region for distributed sources is the set of all achievable rates.

254

Observation 10.10 The achievable rate region is convex.

Proof: For any two achievable rates (R1, . . . , Rm) and (R1, . . . , Rm), we can findtwo sequences of codes to achieve them. Therefore, by properly randomizationamong these two sequences of codes, we can achieve the rates

λ(R1, . . . , Rm) + (1− λ)(R1, . . . , Rm)

for any λ ∈ [0, 1]. 2

Theorem 10.11 (Slepian-Wolf) For distributed sources consisting of two ran-dom variables X1 and X2, the achievable region is

R1 ≥ H(X1|X2)

R2 ≥ H(X2|X1)

R1 +R2 ≥ H(X1, X2).

Proof:

1. Achievability Part: We need to show that for any (R1, R2) satisfying theconstraint, a sequence of code pairs for X1 and X2 with asymptotically zeroerror probability exists.

Step 1: Random coding. For each source sequence xn1 , randomly assign it anindex in 1, 2, . . . , 2nR1 according to uniform distribution. This index isthe encoding output of f(xn1 ). Let Ui = xn ∈ X n

1 : f1(xn) = i.Similarly, form f2(·) by randomly assigning indexes. Let Vi = xn ∈X n

2 : f1(xn) = i.Upon receipt of two indexes (i, j) from the two encoders, the decoder g(·)is defined by

g(i, j) =

(xn1 , x

n2 ) if |(Ui × Vj) ∩ Fn(δ;Xn

1 , Xn2 )| = 1,

and (xn1 , xn2 ) ∈ (Ui × Vj) ∩ Fn(δ;Xn

1 , Xn2 );

arbitrary otherwise.,

where Fn(δ;Xn1 , X

n2 ) is the weakly δ-joint typical set of Xn

1 and Xn2 .

Step 2: Error probability. The error happens when the source sequence pairs(xn1 , x

n2 )

• (Case E0) (xn1 , xn2 ) 6∈ Fn(δ;Xn

1 , Xn2 );

• (Case E1) (∃ xn1 6= xn1 ) f1(xn1 ) = f1(xn1 ) and (xn1 , xn2 ) ∈ Fn(δ;Xn

1 , Xn2 );

255

• (Case E2) (∃ xn2 6= xn2 ) f2(xn2 ) = f2(xn2 ) and (xn1 , xn2 ) ∈ Fn(δ;Xn

1 , Xn2 );

• (Case E3) (∃ (xn1 , xn2 ) 6= (xn1 , x

n2 )) f1(xn1 ) = f2(xn2 ), f2(xn2 ) = f2(xn2 )

and(xn1 , x

n2 ) ∈ Fn(δ;Xn

1 , Xn2 ).

Hence,

Pe(n) ≤ PrE0 ∪ E1 ∪ E2 ∪ E3≤ Pr(E0) + Pr(E1) + Pr(E2) + Pr(E3)

Hence,

E[Pe(n)] ≤ E[Pr(E0)] + E[Pr(E1)] + E[Pr(E2)] + E[Pr(E3)],

where the expectation is taken over the uniformly random coding scheme.It remains to calculate the error for each event.

• (Case E0) By AEP, E[Pr(E0)]→ 0.

256

• (Case E1)

E[Pr(E1)]

=∑xn1∈Xn1

∑xn2∈Xn2

PXn1X

n2(xn1 , x

n2 )

∑

xn1∈Xn1 : xn1 6=xn1 ,(xn1 ,x

n2 )∈Fn(δ;Xn

1 ,Xn2 )

Pr(f1(xn1 ) = f1(xn1 ))

=∑xn1∈Xn1

∑xn2∈Xn2

PXn1X

n2(xn1 , x

n2 )

∑

xn1∈Xn1 : xn1 6=xn1(xn1 ,x

n2 )∈Fn(δ;Xn

1 ,Xn2 )

2−nR1

(By uniformly random coding scheme.)

≤∑xn1∈Xn1

∑xn2∈Xn2

PXn1X

n2(xn1 , x

n2 )

∑xn1∈Xn1 : (xn1 ,x

n2 )∈Fn(δ;Xn

1 ,Xn2 )

2−nR1

≤∑xn1∈Xn1

∑xn2∈Xn2

PXn1X

n2(xn1 , x

n2 )

∑xn1∈Xn1 :

∣∣∣− 1n

log2 PXn1 |Xn2

(xn1 |xn2 )−H(X1|X2)∣∣∣<2δ

2−nR1

=∑xn1∈Xn1

∑xn2∈Xn2

PXn1X

n2(xn1 , x

n2 )× 2−nR1 ×∣∣∣∣xn1 ∈ X n

1 :

∣∣∣∣− 1

nlog2 PXn

1 |Xn2(xn1 |xn2 )−H(X1|X2)

∣∣∣∣ < 2δ

∣∣∣∣≤

∑xn1∈Xn1

∑xn2∈Xn2

PXn1X

n2(xn1 , x

n2 )2n[H(X1|X2)+2δ]2−nR1

= 2−n[R1−H(X1|X2)−2δ],

Hence, we need R1 > H(X1|X2) to make E[Pr(E1)]→ 0.

• (Case E2) Same as Case E1. we need R2 > H(X2|X1) to makeE[Pr(E2)]→ 0.

• (Case E3) By replacing Prf1(xn1 ) = f1(xn) by

Prf1(xn1 ) = f1(xn), f2(xn2 ) = f2(xn2 )

and using the fact that the number of elements in Fn(δ;Xn1 , X

n2 ) is

less than 2n(H(X1,X2)+δ) in the proof of Case E1, we have R1 + R2 >H(X1, X2) implies E[Pr(E3)]→ 0.

2. Converse Part: We have to prove that if a sequence of code pairs for X1 andX2 has asymptotically zero error probability, then its rate pair (R1, R2) satisfiesthe constraint.

257

Step 1: R1 +R2 ≥ H(X1, X2). Apparently, to losslessly compress sources

(X1, X2)

requires at least H(X1, X2) bits.

Step 2: R1 ≥ H(X1|X2). At the extreme case ofR2 > log2 |X2| (no compressionat all for X2), we can follow the proof of lossless data compression theoremto show that R1 ≥ H(X1|X2).

Step 3: R2 ≥ H(X2|X1). Similar argument as in step 2 can be applied. 2

Corollary 10.12 Given sequences of several (correlated) discrete memorylesssources X1, . . . , Xm which are obtained from different terminals (and are to beencoded independently), the achievable code rate region satisfies∑

i∈I

Ri ≥ H(XI |XL−I),

for any index set I ⊂ L , 1, 2, . . . ,m, where XI represents (Xi1 , Xi2 , . . .) fori1, i2, . . . = I.

10.1.2 Partial decoding of the original sources

In the previous subsection, the receiver intends to fully reconstruct all the orig-inal information transmitted, X1, . . . , Xm. In some situations, the receiver mayonly want to reconstruct part of the original information, say Xi for i ∈ I ⊂1, . . . ,m or XI . Since it is in general assumed that X1, . . . , Xm are dependent,the remaining information, Xi for i 6∈ I, should be helpful in the re-constructionof XI . Accordingly, these remain information are usually named the side infor-mation for lossless data compression.

Due to the different intention of the receiver, the definitions of the decodermapping and error probability should be modified.

Definition 10.13 (reconstructed information and side information) LetL , 1, 2, . . . ,m and I is any proper subset of L. Denote XI as the sources Xi

for i ∈ I, and similar notation is applied to XL−I .

In the data compression with side-information, XI is the information needsto be re-constructed, and XL−I is the side-information.

Definition 10.14 (independent encoders among distributed sources) T-here are several sources X1, X2, . . . , Xm (may or may not be independent) which

258

are respectively obtained by m terminals. Before each terminal transmits itslocal source to the receiver, a block encoder fi with rate Ri and block length nis applied

fi : X ni → 1, 2, . . . , 2nRi.

It is assumed that there is no conspiracy among block encoders.

Definition 10.15 (global decoder for independently compressed sourc-es) A global decoder g(·) will recover the original sources after receiving all theindependently compressed sources, i.e.,

g : 1, . . . , 2R1 × · · · × 1, . . . , 2Rm → X nI .

Definition 10.16 (probability of error) The probability of error is definedas

Pe(n) , Prg(fI(XnI ) 6= (Xn

I ).

Definition 10.17 (achievable rates) A rates

(R1, . . . , Rm)

is said to be achievable if there exists a sequence of block codes such that

lim supn→∞

Pe(n) = 0.

Definition 10.18 (achievable rate region for distributed sources) Theachievable rate region for distributed sources is the set of all achievable rates.

Observation 10.19 The achievable rate region is convex.

Theorem 10.20 For distributed sources with two random variable X1 and X2,let X1 be the re-constructed information and X2 be the side information, theboundary function R1(R2) for the achievable region is

R1(R2) ≥ minZ : X1→X2→Z and I(X2;Z)≤R2

H(X1|Z)

The above result can be re-written as

R1 ≥ H(X1|Z) and R2 ≥ I(X2;Z),

for any X1 → X2 → Z. Actually, Z can be viewed as the coding outputs ofX2, received by the decoder, and is used by the receiver as a side information toreconstruct X1. Hence, I(X2;Z) is the transmission rate from sender X2 to thereceiver. For all f2(X2) = Z that has the same transmission rate I(X2;Z), theone that minimize H(X1|Z) will yield the minimum compression rate for X1.

259

10.2 Distributed detection

Instead of re-construction of the original information, the decoder of a multiplesources system may only want to classify the sources into one of finitely manycategories. This problem is usually named distributed detection.

A distributed or decentralized detection system consists of a number of ob-servers (or data collecting units) and one or more data fusion centers. Eachobserver is coupled with a local data processor and communicates with the datafusion center through network links. The fusion center combines all compressedinformation received from the observers and attempts to classify the source ofthe observations into one of finitely many categories.

Definition 10.21 (distributed system Sn) A distributed detection systemSn, as depicted in Fig. 10.3, consists of n geographically dispersed sensors, noise-less one-way communication links, and a fusion center. Each sensor makes anobservation (denoted by Yi) of a random source, quantizes Yi into an m-ary mes-sage Ui = gi(Yi), and then transmits Ui to the fusion center. Upon receipt of(U1, . . . , Un), the fusion center makes a global decision D (U1, . . . , Un) about thenature of the random source.

-

-

-

-

-

-

- Fusion center - D(U1, . . . , Un)

gn

g2

g1

···

Yn

Y2

Y1

Un

U2

U1

Figure 10.3: Distributed detection in Sn.

The optimal design of Sn entails choosing quantizers g1, . . . , gn and a globaldecision rule D so as to optimize a given performance index. In this section,we consider binary hypothesis testing under the (classical) Neyman-Pearson andBayesian formulations. The first formulation dictates minimization of the type IIerror probability subject to an upper bound on the type I error probability; whilethe second stipulates minimization of the Bayes error probability, computedaccording to the prior probabilities of the two hypotheses.

260

The joint optimization of entities g1, . . . , gn and D in Sn is a hard compu-tational task, except in trivial cases (such as when the observations Yi lie ina set of size no greater than m). The complexity of the problem can only bereduced by introducing additional statistical structure in the observations. Forexample, it has been shown that whenever Y1, . . . , Yn are independent given eachhypothesis, an optimal solution can be found in which g1, . . . , gn are threshold-type functions of the local likelihood ratio (possibly with some randomization forNeyman-Pearson testing). Still, we should note that optimization of g1, . . . , gnover the class of threshold-type likelihood-ratio quantizers is prohibitively com-plex when n is large.

Of equal importance are situations where the statistical model exhibits spatialsymmetry in the form of permutation invariance with respect to the sensors. Anatural question to ask in such cases is

Question: whether a symmetric optimal solution exists in which the quan-tizers gi are identical?

If so, then the optimal system design is considerably simplified. The answeris clearly negative for cases where sensor observations are highly dependent; asan extreme example, take Y1 = . . . = Yn = Y with probability 1 under eachhypothesis, and note that any two identical quantizers lead to a redundancy.

The general problem is as follows. System Sn is used for testing H0 : Pversus H1 : Q, where P and Q are one-dimensional marginals of the i.i.d. dataY1, . . . , Yn. As n tends to infinity, both the minimum type II error probabilityβ∗n(α) (as function of the type I error probability bound α) and the Bayes errorprobability γ∗n(π) (as function of the prior probability π of H0) vanish at anexponential rate. It thus becomes legitimate to adopt a measure of asymptoticperformance based on the error exponents

e∗NP(α) , limn→∞

− 1

nlog β∗n(α)

e∗B(π) , limn→∞

− 1

nlog γ∗n(π) .

It was shown by Tsitsiklis [3] that, under certain assumptions on the hypothe-ses P and Q, it is possible to achieve the same error exponents using identicalquantizers. Thus if βn(α), γn(π), eNP(α) and eB(π) are the counterparts of β∗n(α),γ∗n(π), e∗NP(α) and e∗B(π) under the constraint that the quantizers g1, . . . , gn areidentical, then

(∀α ∈ (0, 1)) eNP(α) = e∗NP(α)

and

(∀π ∈ (0, 1)) eB(π) = e∗B(π) .

261

(Of course, for all n, βn(α) ≥ β∗n(α) and γn(π) ≥ γ∗n(π).) This result providessome justification for restricting attention to identical quantizers when designinga system consisting of a large number of sensors.

Here we will focus on two issues. The first issue is the exact asymptoticsof the minimum error probabilities achieved by the absolutely optimal and bestidentical-quantizer systems. Note that equality in the error exponents of γ∗n(π)and γn(π) does not in itself guarantee that for any given n, the values of γ∗n(π)and γn(π) are in any sense close. In particular, the ratio γ∗n(π)/γn(π) may vanishat a subexponential rate, and thus the best identical-quantizer system may bevastly inferior to the absolutely optimal system. (The same argument can bemade for β∗n(α)/βn(α)).

From numerical simulations of Bayes testing in Sn, the ratio γ∗n(π)/γn(π)is (apparently) bounded from below by a positive constant which is, in manycases, reasonably close to unity (but not necessarily one. Cf. Example 10.32).This simulation result is substantiated by using large deviations techniques toprove that γ∗n(π)/γn(π) is, indeed, always bounded from below (it is, of course,upper bounded by unity). For Neyman-Pearson testing, an additional regularitycondition is required in order for the ratio β∗n(α)/βn(α) to be lower-bounded inn. In either case, the optimal system essentially consists of “almost identical”quantizers, and is thus only marginally different from the best identical-quantizersystem.

Definition 10.22 (observational model) Each sensor observation Y = Yitakes values in the measurable space (Y ,B). The distribution of Y under thenull (H0) and alternative (H1) hypotheses is denoted by P and Q, respectively.

Assumption 10.23 P and Q are mutually absolutely continuous, i.e., P ≡ Q.

Under Assumption 10.23, the (pre-quantization) log-likelihood ratio

X(y) , logdP

dQ(y)

is well-defined for y ∈ Y and is a.s. finite. (Since P ≡ Q, “almost surely” and“almost everywhere” are understood as under both P and Q.) In Sn, the variableX(Yi) will also be denoted as Xi.

Assumption 10.24 Every measurable m-ary partition of Y contains an atomover which X = log(dP/dQ) is not almost everywhere constant.

The objective of the above assumption is to guarantee that trivial quantiza-tion in which no compression is applied on the observations cannot be obtained.

262

Definition 10.25 (divergence) The (Kullback-Leibler, informational) diver-gence, or relative entropy, of P relative to Q is defined by

D(P‖Q) , EP [X] =

∫log

dP

dQ(y) dP (y) .

Lemma 10.26 On the convex domain consisting of distribution pairs (P,Q)with the property P ≡ Q, the functional D(P‖Q) is nonnegative and convexwith respect to all (P,Q) pair with the property P ≡ Q.

Lemma 10.27 (Neyman-Pearson type II error exponent of fixed testlevel) The optimal Neyman-Pearson error exponent in testing P versus Q atany level α ∈ (0, 1) based on the i.i.d. observations Y1, . . . , Yn is D(P‖Q).

Definition 10.28 (moment generation function of log-likelihood ratio)Ψ(θ) is the moment generation function of X under Q:

Ψ(θ) , EQ[expθX] =

∫ (dP

dQ(y)

)θdQ(y) .

Lemma 10.29 (concavity of Ψ(θ))

1. For fixed θ ∈ [0, 1], Ψ(θ) is a finite-valued concave functionals of the pair(P,Q) with the property P ≡ Q.

2. For fixed (P,Q) with P ≡ Q, Ψ(θ) is finite and convex in θ ∈ [0, 1].

This last property, together with the fact that Ψ(0) = Ψ(1) = 1, guaranteesthat Ψ(θ) has a minimum value which is less than or equal to unity, achieved bysome θ∗ ∈ (0, 1).

Definition 10.30 (Chernoff exponent) We define the Chernoff exponent

ρ(P,Q) , − log Ψ(θ∗) = − log

[minθ∈(0,1)

Ψ(θ)

].

Lemma 10.31 The Chernoff exponent coincides with the Bayes error exponent.

Example 10.32 (counterexample to γ∗n(π)/γn(π) →1) Consider a ternaryobservation space Y = a1, a2, a3 with binary quantization. The two hypothesesare assumed equally likely, with

263

y a1 a2 a3

P (y) 1/12 1/4 2/3Q(y) 1/3 1/3 1/3

(dP/dQ)(y) 1/4 3/4 2

There are only two nontrivial deterministic LRQ’s: g, which partitions Yinto a1 and a2, a3; and g, which partitions Y into a1, a2 and a3. Thecorresponding output distributions and log-likelihood ratios Xτ (·) and Xτ (·) aregiven by

u 1 2Pτ (u) 1/12 11/12Qτ (u) 1/3 2/3Xτ (u) − log 4 log(11/8)

and

u 1 2Pτ (u) 1/3 2/3Qτ (u) 2/3 1/3Xτ (u) − log 2 log 2

(10.2.1)

ρ(Pτ , Qτ ) =1

2log

9

8= 0.0589 > 0.0534 = ρ(Pτ , Qτ ) ,

and thus for n sufficiently large, the best identical-quantizer system Sn employsg on all n sensors (the other choice yields a suboptimal error exponent and thuseventually a higher value of γn(π)).

We will now show by contradiction that if S∗n is an absolutely optimal systemconsisting of deterministic LRQ’s, then for all even values of n, at least one ofthe quantizers in S∗n must be g.

Assume the contrary, i.e., S∗n is such that for all i ≤ n, gi = g. Now considerthe problem of optimizing the quantizer gn in Sn subject to the constraint thateach of the remaining quantizers g1, . . . , gn−1 equals g. It is known that eithergn = g or gn = g is optimal. Our assumption about S∗n then implies that gn = gis, in fact, optimal.

To see why this cannot be so if n is even, consider the Bayes error probabilityfor this problem. Writing un1 for (u1, . . . , un), we have

γn

(1

2

)=

1

2

∑un1∈1,2n

[Pτ (un−11 )Pgn(un)] ∧ [Qτ (u

n−11 )Qgn(un)]

=1

2

∑un−11 ∈1,2n−1

∑un∈1,2

[Pτ (un−11 )Pgn(un)] ∧ [Qτ (u

n−11 )Qgn(un)]

=∑

un−11 ∈1,2n−1

[Pτ (u

n−11 ) +Qτ (u

n−11 )

2

]γ

(Pτ (u

n−11 )

Pτ (un−11 ) +Qτ (u

n−11 )

)(10.2.2)

264

where γ(·) represents the Bayes error probability function of the nthsensor/quantizer pair (note that in this equation, the argument of γ(·) is justthe posterior probability of H0 given un−1

1 ).

We note from (10.2.1) that the log-likelihood ratio

logPτ (u

n−11 )

Qτ (un−11 )

= Xτ (u1) + · · ·+Xτ (un−1)

can be also expressed as (2ln−1(u) − n + 1)(log 2), where ln−1(u) is the numberof 2’s in un−1

1 . Now ln−1(U) is a binomial variable under either hypothesis, andwe can rewrite (10.2.2) as

γn

(1

2

)=

n−1∑l=0

(n−1l

) [2l + 2n−l−1

2 · 3n−1

]γ

(22l−n+1

22l−n+1 + 1

). (10.2.3)

The two candidates for γ are γ and γ, given by

γ(π) =

[1

12π ∧ 1

3(1− π)

]+

[11

12π ∧ 2

3(1− π)

]γ(π) =

[1

3π ∧ 2

3(1− π)

]+

[2

3π ∧ 1

3(1− π)

]and shown in Figure 10.4. Note that γ(π) = γ(π) for π ≤ 1/3, π = 4/7 and π ≥4/5. Thus the critical values of l in (10.2.3) are those for which 22l−n+1/(22l−n+1+1) lies in the union of (1/3, 4/7) and (4/7, 4/5).

If n is odd, then the range of 2l−n+1 in (10.2.3) comprises the even integersbetween −n+ 1 and n− 1 inclusive. The only critical value of l is (n− 1)/2, forwhich the posterior probability of H0 is 1/2. Since γ(1/2) = 3/8 > 1/3 = γ(1/2),the optimal choice is g.

If n is even, then 2l − n + 1 ranges over all odd integers between −n + 1and n − 1 inclusive. Here the only critical value of l is n/2, which makes theposterior probability of H0 equal to 2/3. Since γ(2/3) = 5/18 < 1/3 = γ(2/3),g is optimal.

We thus obtain the required contradiction, together with the inequality

γ2k

(1

2

)− γ∗2k

(1

2

)≥

(2k−1k

) [2k + 2k−1

2 · 32k−1

] [γ

(2

3

)− γ

(2

3

)]=

1

8

(2k−1k

) (29

)k.

Stirling’s formula gives (4πk)−1/2 exp−k log 4(1 + o(1)) for the binomial coef-ficient, and thus

lim infk→∞

(γ2k

(1

2

)− γ∗2k

(1

2

))√πk

(9

8

)k≥ 1

16. (10.2.4)

265

0

1

0 11/3 4/7 4/5

1/3

π

γ(π)

γ(π)

Figure 10.4: Bayes error probabilities associated with g and g.

Since (9/8)k = exp2kρ(Pτ , Qτ ) = exp2kρ2, we immediately deduce from(10.2.4) that

lim supk→∞

γ∗2k(1/2)

γ2k(1/2)< 1 .

A finer approximation to

γ2k

(1

2

)= QXτ (U1)+ · · ·+Xτ (U2k) > 0 +

1

2QXτ (U1)+ · · ·+Xτ (U2k) = 0

using [2, Theorem 1] and Stirling’s formula for the first and second summands,respectively, yields

limk→∞

γ2k

(1

2

)√πk

(9

8

)k=

3

2.

From (10.2.4) we then obtain

lim supk→∞

γ∗2k(1/2)

γ2k(1/2)≤ 23

24.

266

10.2.1 Neyman-Pearson testing in parallel distributed d-etection

The equality (and finiteness) of e∗NP(α) and eNP(α) was established in Theorem 2of [3] under the assumption that EP [X2] <∞. Actually, the proof only utilizedthe following weaker condition for δ = 0 on the post-quantization log-likelihoodratio.

Assumption 10.33 (boundedness assumption) There exists δ ≥ 0 forwhich

supg∈Gm

EP [|Xg|2+δ] <∞, (10.2.5)

where Gm is the set of all possible m-ary quantizers.

Let us briefly examine the above assumption. For an arbitrary randomizedquantizer g, let pu = Pg(u) and qu = Qg(u). Then

EP [|Xg|2+δ] =m∑u=1

pu

∣∣∣∣logpuqu

∣∣∣∣2+δ

.

Our first observation is that the negative part X−g of Xg has bounded (2+δ)-th moment under P , and thus Assumption 10.33 is equivalent to

supgEP [(X+

g )2+δ] <∞.

Indeed, we have

EP [|X−g |2+δ] =∑

u : pu<qu

pu

∣∣∣∣logpuqu

∣∣∣∣2+δ

=∑

u : pu<qu

pu

∣∣∣∣logqupu

∣∣∣∣2+δ

,

and using the inequality | log x1/(2+δ)| ≤ x1/(2+δ) − 1 for x ≥ 1, we obtain

EP [(X−g )2+δ] ≤ (2 + δ)2+δ∑

u : pu<qu

pu

((qupu

)1/(2+δ)

− 1

)2+δ

≤ (2 + δ)2+δ

m∑u=1

qu ≤ (2 + δ)2+δ .

Theorem 10.34 Assumption 10.33 is equivalent to the condition

supτ∈T2

EP [|Xτ |2+δ] < ∞, (10.2.6)

where T2 is the set of all possible binary log-likelihood ratio quantizers.

267

Proof: Assumption 10.33 clearly implies (10.2.6). To prove the converse, let τtbe the LRP in T2 defined by

τt ,((−∞, t] , (t,∞)

), (10.2.7)

and let p(t) = PX > t, q(t) = QX > t. By (10.2.6), there exists b < ∞such that for all t ∈ <,

EP [|Xτt |2+δ] = (1− p(t))∣∣∣∣log

1− p(t)1− q(t)

∣∣∣∣2+δ

+ p(t)

∣∣∣∣logp(t)

q(t)

∣∣∣∣2+δ

≤ b . (10.2.8)

Consider now an arbitrary deterministic m-ary quantizer g with output pmf’s(p1, . . . , pm) and (q1, . . . , qm), and let the maximum of pu| log(pu/qu)|2+δ subjectto pu ≥ qu be achieved at u = u∗. Then, from the first observation in this section,it follows that

EP [|Xg|2+δ] ≤ mp∗

∣∣∣∣logp∗q∗

∣∣∣∣2+δ

+ (2 + δ)2+δ .

To see that p∗ |log(p∗/q∗)|2+δ is bounded from above if (10.2.8) holds, note thatfor a given p∗ = PY ∈ C∗, the value of q∗ = QY ∈ C∗ can be bounded frombelow using the Neyman-Pearson lemma. In particular, there exist t ∈ < andµ ∈ [0, 1] such that

p∗ = µPX = t+ p(t) ,

q∗ ≥ µQX = t+ q(t) .

If PX = t = 0, then

p∗

∣∣∣∣logp∗q∗

∣∣∣∣2+δ

≤ p(t)

∣∣∣∣logp(t)

q(t)

∣∣∣∣2+δ

,

where by virtue of (10.2.8), the r.h.s. is upper-bounded by b. Otherwise, ther.h.s. is of the form(

p(t) + µPX = t) ∣∣∣∣log

p(t) + µPX = tq(t) + µQX = t

∣∣∣∣2+δ

,

for some µ ∈ (0, 1]. For µ = 1, this can again be bounded using (10.2.6): take τ ′tconsisting of intervals (−∞, t) and [t,∞), or use τt and a simple continuity argu-ment. Then the log-sum inequality can be applied together with the concavityof f(t) = t1/(2+δ) to show that the same bound b is valid for µ ∈ (0, 1).

If g is a randomized quantizer, then the probabilities p∗ and q∗ defined previ-ously will be expressible as

∑k λkp

(k) and∑

k λkq(k). Here k ranges over a finite

index set, and each pair (p(k), q(k)) is derived from a deterministic quantizer.Again, using the log-sum inequality and the concavity of f(t) = t1/(2+δ), one canobtain p∗ |log(p∗/q∗)|2+δ ≤ b. 2

268

Observation 10.35 The boundedness assumption is equivalent to

lim supt→∞

EP [|Xτt |2+δ] < ∞ , (10.2.9)

where τt is defined in (10.2.7).

Theorem 10.36 If Assumption 10.33 holds, then for all α ∈ (0, 1),

− 1

nlog β∗n(α) = Dm +O(n−1/2)

and

− 1

nlog βn(α) = Dm +O(n−1/2) . 2

As an immediate corollary, we have e∗NP(α) = eNP(α) = Dm, which is The-orem 2 in [3]. The stated result sharpens this equality by demonstrating thatthe finite-sample error exponents of the absolutely optimal and best identical-quantizer systems converge at a rate O(n−1/2). It also motivates the followingobservations.

The first observation concerns the accuracy, or tightness, of the O(n−1/2)convergence factor. Although the upper and lower bounds on β∗n(α) in Theo-rem 10.36 are based on a suboptimal null acceptance region, examples in thecontext of centralized testing show that in general, the O(n−1/2) rate cannot beimproved on. At the same time, it is rather unlikely that the ratio β∗n(α)/βn(α)decays as fast as exp−c′

√n, i.e., there probably exists a lower bound which is

tighter than what is implied by Theorem 10.36. As we shall see later in Theo-rem 10.40, under some regularity assumptions on the mean and variance of thepost-quantization log-likelihood ratio computed w.r.t. the null distribution, theratio β∗n(α)/βn(α) is indeed bounded from below.

The second observation is about the composition of an optimal quantizer set(g1, . . . , gn) for Sn. The proof of Theorem 10.36 also yields an upper bound onthe number of quantizers that are at least ε–distant from the deterministic LRQthat achieves eNP(α). Specifically, if Kn(ε) is the number of indices i for which

D(Pgi‖Qgi) < Dm − ε

(where ε > 0), thenKn(ε)

n= O(n−1/2) .

Thus in an optimal system, most quantizers will be “essentially identical” to theone that achieves eNP(α). (This conclusion can be significantly strengthened ifadditional assumption is made in the case of Neyman-Pearson testing [1].)

269

In the remainder of this section we discuss the asymptotics of Neyman-Pearson testing in situations where Assumption 10.33 does not hold. By theremark following the proof of Theorem 10.34, this condition is violated if andonly if

lim supt→∞

EP [X2τt ] = ∞ , (10.2.10)

where τt is the binary LRP defined by

τt =(

(−∞, t] , (t,∞)).

We now distinguish between three cases.

Case A. lim supt→∞

EP [Xτt ] = ∞ .

Case B. 0 < lim supt→∞

EP [Xτt ] < ∞ .

Case C. lim supt→∞

EP [Xτt ] = 0 and lim supt→∞

EP [X2τt ] = ∞ .

Example Let the observation space be the unit interval (0, 1] with its Borelfield. For a > 0, define the distributions P and Q by

PY ≤ y = y , QY ≤ y = exp

a+ 1

a

(1− 1

ya

).

The pdf of Q is strictly increasing in y, and thus the likelihood ratio (dP/dQ)(y)is strictly decreasing in y. Hence the event X > t can also be written asY < yt, where yt → 0 as t→∞. Using this equivalence, we can examine thelimiting behavior of EP [Xτt ] and EP [X2

τt ] to obtain:

a. a > 1: limt→∞EP [Xτt ] = limt→∞EP [X2τt ] =∞ (Case A)

b. a = 1: limt→∞EP [Xτt ] = 2, limt→∞EP [X2τt ] =∞ (Case B)

c. 1/2 < a < 1: limt→∞EP [Xτt ] = 0, limt→∞EP [X2τt ] =∞ (Case C)

d. a ≤ 1/2: limt→∞EP [X2τt ] <∞ (Assumption 10.33 is satisfied) .

2

In Case A, the error exponents e∗NP(α) and eNP(α) are both infinite. Thisresult is neither difficult to prove nor surprising, considering that EP [Xg] =D(Pg‖Qg) can be made arbitrarily large by choice of the quantizer g.

270

Theorem 10.37 (result for Case A) If lim supt→∞EP [Xτt ] = ∞, then forall m ≥ 2 and α ∈ (0, 1),

e∗NP(α) = eNP(α) = ∞ . 2

We now turn to case B, which is more interesting. Using the notation p(t) =PX > t and q(t) = QX > t introduced earlier, we have

EP [Xτt ] = (1− p(t)) log1− p(t)1− q(t)

+ p(t) logp(t)

q(t).

The first summand on the r.h.s. clearly tends to zero (as t→∞, which is under-stood throughout), hence the lim sup of the second summand p(t) log[p(t)/q(t)]is greater than zero. Since p(t) tends to zero, both log[p(t)/q(t)] andp(t) log2[p(t)/q(t)] have lim sup equal to infinity. Thus in particular, (10.2.10)always holds in Case B.

A separate argument (which we omit) reveals that the centralized error ex-ponent D(P‖Q) = EP [X] is also infinite in Case B. Yet unlike Case A, thedecentralized error exponent e∗NP(α) obtained here is not infinite. Quite surpris-ingly, if this exponent exists, then it must depend on the test level α. This isstated in the following theorem.

Theorem 10.38 (result for Case B) Consider hypothesis testing with m-aryquantization, where m ≥ 2. If

0 < lim supt→∞

EP [Xτt ] < ∞ , (10.2.11)

then there exist:

1. an increasing sequence of integers nk, k ∈ N and a function L : (0, 1) 7→(0,∞) which is monotonically increasing to infinity, such that

lim infk→∞

− 1

nklog βnk(α) ≥ L(α) ∨ Dm ;

2. a function M : (0, 1) 7→ (0,∞) which is monotonically increasing toinfinity and is such that

lim supn→∞

− 1

nlog β∗n(α) ≤ M(α) .

271

Proof:

(i) Lower bound. As was argued in the proof of Theorem 10.36, an error exponentequal to Dm can be achieved using identical quantizers; hence one part of thebound follows immediately. In what follows, we construct a sequence of identical-quantizer detection schemes with finite sample-error exponent almost alwaysexceeding L(α), where L(α) increases to infinity as α tends to unity.

Let ν , lim supt→∞EP [Xτt ], so that 0 < ν <∞ by (10.2.11). By subsequenceselection, we obtain a sequence of LRP’s τtk with the property that νk , EP [Xτk ]converges to ν as k → ∞. (We eliminate t from all subscripts to simplify thenotation.) Letting pk = p(tk), qk = q(tk), εk = log[(1 − pk)/(1 − qk)] andζk = log(pk/qk), we can write

νk = (1− pk)εk + pkζk ,

where by the discussion preceding this theorem, pk and εk tend to zero and ζkincreases to infinity.

Fix ω > 0 and assume w.l.o.g. that ζk > ζk−1 + (1/ω). Consider a systemconsisting of nk sensors, where nk = bωζkc, and let each sensor employ the samebinary LRQ with LRP τk. This choice is clearly suboptimal (since m ≥ 2), butit suffices for our purposes.

Define the set Ak ⊂ 1, 2nk by

Ak = (u1, . . . , unk) : at least one uj equals 2 .

Recalling that Uj = 2 if, and only if, the observed log-likelihood ratio is largerthan tk, we have

Pτ (Ak) = 1− (1− pk)nk ,

and thus

limk→∞

(1− Pτ (Ak)) = limk→∞

(1− pk)nk

= limk→∞

(1− pk)ωζk

=(

limk→∞

(1− pk)ζk)ω

= exp−νω ,

where the last equality follows from the fact that ζk → ∞ and pkζk → ν ask →∞.

Thus given any η > 0, for all sufficiently large k the set Ak is admissible(albeit not necessarily optimal) as a null acceptance region for testing at level

272

α = exp−νω + η. For this value of α, we have

βnk(α) ≤ Qτ (Ak)= 1− (1− qk)nk

= 1− (1− pk exp−ζk)nk

= nkpk exp−ζk −nk(nk − 1)

2!p2k exp−2ζk

+ · · · − (−1)nkpnkk exp−nkζk≤ nkpk exp−ζk+ n2

kp2k exp−2ζk+ · · ·

Summing the geometric series, we obtain

βnk(α) ≤ nkpk exp−ζk1− nkpk exp−ζk

.

The r.h.s. denominator tends to unity because ζk → ∞ and nkpk → ων ask →∞. Since ζk/nk → 1/ω, we conclude that

lim infk→∞

− 1

nklog βnk(α) ≥ 1

ω=

ν

log(1/α) + η.

As ω > 0 and η > 0 were chosen arbitrarily, it follows that

lim infk→∞

− 1

nklog βnk(α) ≥ ν

log(1/α)

for all α ∈ (0, 1). The lower bound in statement (i) of the theorem is obtainedby taking L(α) , ν/ log(1/α).

(ii) Upper bound. Consider an optimal detection scheme for Sn, with the samesetup as in the proof of Theorem 10.36. Recall in particular that the fusion centeremploys a randomized test with log-likelihood threshold ηn and randomizationconstant µn.

For θ to be an upper bound on the error exponent of β∗n(α), it suffices that nθbe greater than ηn and such that the events

∑ni=1 Xgi ≤ nθ

and

∑ni=1 Xgi ≥

ηn

have significant overlap under Pg. Indeed, if θ > ηn/n is such that for allsufficiently large n,

µnPg

∑n

i=1Xgi = ηn

+ Pg

nθ ≥

∑n

i=1Xgi > ηn

≥ ε > 0 , (10.2.12)

then β∗n(α) > ε exp−nθ, as required.

The threshold ηn is rather difficult to determine, so we use an indirect methodfor finding θ. We have

Pg

∑n

i=1Xgi > nθ

≤ Pg

∑n

i=1|Xgi | > nθ

≤ 1

θsupg∈Gm

EP [|Xg|] ,

273

where the last bound follows from the Markov inequality. We claim that thesupremum in this relationship is finite. This is because the negative part ofXg has bounded expectation under P (see the discussion following Assump-tion 10.33), and the proof of Theorem 10.34 can be easily modified to show thatν ′ , supg∈Gm EP [|Xg|] is finite iff ν = lim supt→∞EP [Xτt ] is (which is our currenthypothesis). Thus

Pg

∑n

i=1Xgi > nθ

≤ ν ′

θ.

Now let ε > 0 and θ = ν ′/(1−α−ε), so that Pg∑n

i=1Xgi > nθ≥ 1−α−ε.

The Neyman-Pearson lemma immediately yields nθ > ηn. Also, using

µnPg

∑n

i=1Xgi = ηn

+ Pg

∑n

i=1Xgi > ηn

= 1− α

and a simple contradiction, we obtain (10.2.12). Thus the chosen value of θ is anupper bound on lim supn(−1/n) log β∗n(α). Since ε > 0 can be made arbitrarilysmall, we have that

M(α) ,ν ′

1− αis also an upper bound. 2

From Theorem 10.38 we conclude that in Case B, the error exponent e∗NP(α)must lie between the bounds L(α) and M(α) whenever it exists as a limit. (Sinceν ′ ≥ Dm ≥ ν and 1−α ≤ log(1/α), the inequality M(α) ≥ L(α) is indeed true.)These bounds are shown in Figure 10.5.

Theorem 10.39 (result for Case C) In Case C,

e∗NP(α) = eNP(α) = D(P‖Q).

Theorem 10.40 (result under boundedness assumption) Let δ ≤ 1 sat-isfy (10.2.5). If α ≤ 1/2, or if α > 1/2 and observation space Y is finite, then

β∗n(α)

βn(α)≥ exp−c′(δ, α)n

1−δ2 .

In particular, if (10.2.5) holds for δ ≥ 1, then the ratio β∗n(α)/βn(α) is boundedfrom below.

10.2.2 Bayes testing in parallel distributed detection sys-tems

We now turn to the asymptotic study of optimal Bayes detection in Sn. Theprior probabilities of H0 and H1 are denoted by π and 1 − π, respectively, the

274

0 1

Upper bound

Lower bound

α

e∗NP(α)

Dm

ν ′

Figure 10.5: Upper and lower bounds on e∗NP(α) in Case B.

probability of error of the absolutely optimal system is denoted by γ∗n(π), and theprobability of error of the best identical-quantizer system is denoted by γn(π).

In our analysis, we will always assume that Sn employs deterministic m-ary LRQ’s represented by LRP’s τ1, . . . , τn. This is clearly sufficient becauserandomization is of no help in Bayes testing.

Theorem 10.41 In Bayes testing with m-ary quantization,

lim infn→∞

γ∗n(π)

γn(π)> 0 (10.2.13)

for all π ∈ (0, 1).

10.3 Capacity region of multiple access channels

Definition 10.42 (discrete memoryless multiple access channel) A dis-crete memoryless multiple access channel contains several senders

(X1, X2, . . . , Xm)

and one receiver Y , which are respectively defined over finite alphabet (X1,X2,. . .)and Y . Also given is the transition probability PY |X1,X2,...,Xm .

275

For simplicity, we will focus on the system with only two senders. The blockcode for this simple multiple access channel is defined below.

Definition 10.43 (block code for multiple access channels) A block code

(n,M1M2)

for multiple access channel has block length n and rates R1 = (1/n) log2M1 andR2 = (1/n) log2M2 respectively for each sender as:

f1 : 1, . . . ,M1 → X n1 ,

andf2 : 1, . . . ,M2 → X n

2 .

Upon receipt of the channel output, the decoder is a mapping

g : Yn → 1, . . . ,M1 × 1, . . . ,M2.

Theorem 10.44 (capacity region of memoryless multiple access chan-nel) The capacity region for memoryless multiple access channel is the convexset of the set

(R1, R2) ∈ (<+ ∪ 0)2 : R1 ≤ I(X1;Y |X2), R2 ≤ I(X2;Y |X1)

and R1 +R2 ≤ I(X1, X2;Y ) .

Proof: Again, the achievability part is proved using random coding argumentsand typical-set decoder. The converse part is based on the Fano’s inequality. 2

10.4 Degraded broadcast channel

Definition 10.45 (broadcast channel) A broadcast channel consists of oneinput alphabet X and two (or more) output alphabets Y1 and Y2. The noise isdefined by the conditional probability PY1,Y2|X(y1, y2|x).

Example 10.46 Examples of broadcast channels are

• Cable Television (CATV) network;

• Lecturer in classroom;

• Code Division Multiple Access channels.

276

Definition 10.47 (degraded broadcast channel) A broadcast channel issaid to be degraded if

PY1,Y2|X(y1, y2|x) = PY1|X(y1|x)PY2|Y1(y2|y1).

It can be verified that when X → Y1 → Y2 forms a Markov chain, in whichPY2|Y1,X(y2|y1, x) = PY2|Y1(y2|y1), a degraded broadcast channel is resulted. Thisindicates that the “parallelly” broadcast channel degrades to a “serially” broad-cast channel, where the channel output Y2 can only obtain information fromchannel input X through the previous channel output Y1.

Definition 10.48 (block code for broadcast channel) A block code forbroadcast channel consists of one encoder f(·) and two (or more) decoders gi(·)as

f : 1, . . . , 2nR1 × 1, . . . , 2nR2 → X n,

andg1 : X n → 1, . . . , 2nR1,

g2 : X n → 1, . . . , 2nR2.

Definition 10.49 (error probability) Let the source index random variablebe W1 and W2, namely W1 ∈ 1, . . . , 2nR1 and W2 ∈ 1, . . . , 2nR2. Then theprobability of error is defined as

Pe , PrW1 6= g1[f(W1,W2)] or W2 6= g2[f(W1,W2)].

Theorem 10.50 (capacity region for degraded broadcast channel) Thecapacity region for memoryless degraded broadcast channel is the convex hull ofthe closure of⋃

U

(R1, R2) : R1 ≤ I(X;Y1|U) and R2 ≤ I(U ;Y2) ,

where the union is taking over all U satisfying U → X → (Y1, Y2) with alphabetsize |U| ≤ min|X |, |Y1|, |Y2|.

Note that U → X → Y1Y2 is equivalent to

PU,X,Y1,Y2(u, x, y1, y2) = PU(u)PX|U(x|u)PY1,Y2|X,U(y1, y2|x, u)

= PU(u)PX|U(x|u)PY1,Y2|X(y1, y2|x)

= PU(u)PX|U(x|u)PY1|X(y1|x)PY2|Y1(y2|y1).

277

Example 10.51 (capacity region for degraded BSC) Suppose PY1|X andPY2|Y1 are BSC with crossover ε1 and ε2, respectively. Then the capacity regioncan be parameterized through β as:

R1 ≤ hb(β × ε1)− hb(ε1)

R2 ≤ 1− hb(β × (ε1(1− ε2) + (1− ε1)ε2)),

where PX|U(0|1) = PX|U(1|0) = β and U ∈ 0, 1.

Example 10.52 (capacity region for degraded AWGN channel) Thechannel is modeled as

Y1 = X +N1 and Y2 = Y1 +N2,

where the noise power for N1 and N2 are σ21 and σ2

2, respectively. Then thecapacity region for input power constraint S should satisfy

R1 ≤1

2log2

(1 +

αS

σ21

)R2 ≤

1

2log2

(1 +

(1− α)S

αS + σ21 + σ2

2

),

for any α ∈ [0, 1].

10.5 Gaussian multiple terminal channels

In the previous chapter, we have dealt with the Gaussian channel with possiblyseveral inputs, and obtained the the best power distribution of each input shouldfollow the water-filling scheme. In the problem, the encoder is defined as

f : <m → <,

which means that all the inputs are observed by the same terminal and hence,can be utilized in a centralized fashion. However, for Gaussian multiple terminalchannels, these inputs are observed distributed by different channels, and hence,need to be encoded separately. In other words, the encoder now becomes

f1 : < → <f2 : < → <

...

fm : < → <

278

So we have now m (independent) transmitters, and one receiver in the system.

The system can be modeled as

Y =m∑i=1

Xi +N.

Theorem 10.53 (capacity region for AWGN multiple access channel)Suppose each transmitter has (constant) power constraint Si. Let I denote thesubset of 1, 2, . . . ,m. Then the capacity region should be

(R1, . . . , Rm) : (∀ I)∑i∈I

Ri ≤1

2log2

(1 +

∑i∈I Si

σ2

),

where σ2 is the noise power of N .

Note that Si/σ2 is the SNR ratio for each terminal. So if one can afford to

unlimited number of terminals with fixed transmitting power S, he can makethe channel capacity as large as he desired. On the other hand, if we have theconstraint that the total power of the system is fixed, then distributing the powerto m terminals can only reduce the capacity due to the distributed encoding.

279

Bibliography

[1] P.-N. Chen and A. Papamarcou, “New asymptotic results in parallel dis-tributed detection,” IEEE Trans. Info. Theory, vol. 39, no. 6, pp. 1847–1863,Nov. 1993.

[2] J. A. Fill and M. J. Wichura, “The convergence rate for the strong lawof large numbers: General lattice distributions,” Probab. Th. Rel. Fields,vol. 81, pp. 189–212, 1989.

[3] J. N. Tsitsiklis, “Decentralized Detection by a large number of sensors,”Mathematics of Control, Signals and Systems, vol. 1, no. 2, pp. 167–182,1988.

280

Lecture Notes in Information Theory Volume IIshannon.cm.nctu.edu.tw/it/itvol22004.pdfLecture Notes...

Documents

Transcript of Lecture Notes in Information Theory Volume IIshannon.cm.nctu.edu.tw/it/itvol22004.pdfLecture Notes...