Signal Modeling, Statistical Inference and Data Mining in...

A6523 Signal Modeling, Statistical Inference and Data

Mining in Astrophysics Spring 2013

Lecture 14 –  Maximum entropy spectral estimation

Reading: Chaper 13: Bayesian Revolution in Spectral Analysis

Maximum Entropy Spectral Estimate •  The MESE is identical to fitting an

autoregressive (AR) model to a time series

•  The unknowns are PM, αj, j=1,M •  We will show this equivalence later.

S(f) =PM��

1 +M�

j=1

αje2πifj∆τ

��

2

Xt =M�

j=1

αjXt−j + nt

1981ApJS...45....1S

Autoregressive processes (M=2)

xt = nt +M�

i=1

aixt−i

nt = white noise ai = AR coefficient

From Scargle 1981

CORRESPONDENCE 493

However, since the marginals located near the boundary of the It follows that the particular choice for u(N + 1) maximizes this accept-reject region are approximately equal in importance, these quantity. In order to obtain r(N + 2), the value for r(N + 1) just errors will not materially affect the outcome. found is substituted in R(N + 2) and after that det [R(N + 2)], and

J. J. FREEMAN therefore the corresponding entropy is maximized with respect to

Electron. Div. r(N + 2). Substitution of the values of r(N + 1) and r(N + 2) so

Naval Res. Lab. obtained in det [R(N + 3)] and maximizing yield r(N + 3) and so

Washingtonr D.C’ 20390 on. Note that in general this procedure is different from maximizing det [R(N + 3)] with respect to r(N + l),r(N + 2),r(N + 3).

For what follows it is important to note that maximizing det [R(N + l)] with respect to r(N + 1) is equivalent to the statement

r(0) ... r(N - 1) Alternative Interpretation of Maximum Entropy Spectral Analysis

4) ... r(N - ‘3 = o. (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . r(N + 1) r(N) ... r(1)

Abstract-Maximum entropy spectral analysis is a method for the estimation of power spectra with a higher resolution than can be obtained with conventional techniques. This is achieved by extrapolation of the autocorrelation function in such a way that the entropy of the corre-

III. INTERPRETATION IN TERMS OF ALL-POLE MODELS Consider the following all-pole model

y(n) + al’y(n - 1) + . . . + a~,‘y(n - N) = e(n) (2) sponding probability density function is maximized in each step of the extrapolation. This correspondence also gives a simple interpretation of where e(n) is a zero-mean random variable with E[e(i)e(j)] = 0 for the method without entropy considerations. i # j. Since E[e(n)y(n - k)] = 0 for k > 0, multiplication of both

sides of (2) with y(n - k) and taking expectations yields

I. INTRODUCTION In the classical approach to calculation of power spectra the sample

autocorrelation function is multiplied by some lag window and then transformed [I]. The window is applied to provide statistical stability and to avoid leakage from other parts of the spectrum. On the other hand, the window limits the spectral resolution.

In the literature, methods are found for the improvement of the spectral resolution through extrapolation of the correlation function beyond the maximum lag [2], [3]. One of these methods, the so-called maximum entropy spectral analysis [3], is discussed in this correspondence. The idea behind the method is to extrapolate the autocorrelation function in such a way that the entropy of the corresponding probability density function is maximized in each step of the extrapolation.

It is shown that the method is equivalent to least-squares fitting of a discrete-time all-pole model to the data.

II. MAXIMUM ENTROPY SPECTRAL ANALYSIS [3] Suppose that the first (N + 1) points r(O),r(l), . . .,r(N) of the

autocorrelation function of a stationary Gaussian process are exactly known. The problem is now to find a value for r(N + 1). Therefore consider

/r (0) 4) ... r(N) W + I)\

R(N + 1) = I 4) r(O) ... r(N - 1) r(N)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

making det [R(N + 1)] equal to zero and all values in between.

\

The basic idea of maximum entropy spectral analysis is to select

r(N + 1) r(N) ... r(1) 43 /

The quantities r(O),r(l),. . .

the value of r(N + 1) that maximizes det [R(N + l)]. The entropy

,r(N),r(N + 1) belong to an autocorrelation only if R(N + 1) is semipositive definite. It follows that

of the N + 2 dimensional Gaussian probability density function with

det [R(N + 1)] must be nonnegative. Det [R(N + l)] is a quadratic function of r(N + 1). The second derivative of det [R(N + l)] with

covariance matrix R(N + 1) is equal to [4]

respect to r(N + 1) is -2 det [R(N - l)], which is always negative. This implies that det [R(N + l)] as a function of r(N + 1) has a single maximum. Hence the allowable values of r(N + 1) are those

In (2~e)~+“’ det [R(N + l)]‘.

r’(k) + aI’r’(k - 1) + . . * + aN’r’(k - N) = 0 for k > 0

with r’(k) = E[y(n)y(n - k)]. It follows that

r’(l) + aI’r’(0) + . . . + a,‘r’(N - 1) = 0 r ‘(74 + aI’r’(l) + ... + aN’r ‘(N - 2) = 0 (3)

r’(N + 1) + al/r’(N) + ... + aN’r’(l) = 0

and hence

r'(0) ... r’(N - 1) r’(1) ... r’(N - 2)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \r’(N + 1) r’(N) ... r’(1) /

Now, suppose that the first N + 1 exact values r(O),r(l); * .,r(N) of any autocorrelation function r(k) are available. Substitution of the values in the first N equations of (3) and solving yields a set of coefficients aI,. . . ,aN. The value for r(N + 1) for a model with these coefficients, however, is uniquely determined by

/r(l) r(O) ... r(N - l)\ r(l) ... r(N - 2)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . r(N + 1) r(N) ... r(1)

From (1) it follows that the same solution for r (N + 1) would have been obtained from maximum entropy spectral analysis. In the same way it can be proved that all values of the extrapolated autocorrelation

IV. SAMPLE CONSIDERATIONS

function coincide with those corresponding to the model 1, aI,. . ,a,+ So it is concluded that maximum entropy spectral analysis is equivalent

In practice only a finite record of the signal y(t) is available and the

to matching the coefficients of an Nth-order model to the N + 1 given points of the autocorrelation function. Once the coefficients of

autocorrelation function can not be determined exactly. The usual

the scheme have been obtained the computation of the corresponding spectrum is straightforward.

approach to fitting an Nth-order all-pole model to a record of length A4 is as follows [5]. The quantity

.,=$,.1( y(m) + 5 U”Yk - n)

2 (4)

il=1 Manuscript received March 25, 1970; revised November23, 1970.

Proof that maximization of entropy for a Gaussian process is equivalent to extending the ACF using an AR model Van den Bos (1971)

This is the N+2 x N+2 covariance matrix that includes an unknown r(N+1) value

494 IEEE TRANSACTIONS ON INFORMATION THEORY, JULY 197 1

is minimized with respect to al,az; . ‘,uN. The minimization of (4) yields a set of linear equations equivalent to the first N equations of (3) with the exact autocorrelation functions replaced by their estimates. It is clear that the maximum entropy procedure yields the same results when applied to these estimates. Therefore, the extrapolation beyond the (N + 1)th sample autocorrelation point is equivalent to least- squares fitting of an Nth-order all-pole model to the data.

Asymptotic sampling properties of the least-squares coefficient estimates were derived by Mann and Wald [6]. The asymptotic expression for the Nth-order covariance matrix of the estimates is

CONVERGENCE The problem of stochastic approximation has been mentioned by

Sklansky [1] and Fu ei al. [2]. Dvoretzky [3] provided a powerful general theorem useful in proving the mean-square convergence of a class of stochastic approximation algorithms. To make this class more useful particularly in the area of learning systems, it has been suggested by Fu that methods be developed to improve the rate of convergence.

Consider an algorithm of the form

where &+I = 2, + yn+l~.f,+l(r,+l) - ~1 (1)

r n+l = (rd2,. . . ,rn+l )

where the (i, j)th element of R,(N) is given by the sample autocorrela- to be used in the presence of an ergodic process where & is the nth tion function rM(i - .j) of y(m) and sZ denotes the minimum value of estimate of x (the true value sought) of the mean of a normal distribu- (4) divided by M - N. Moreover, the coefficient estimates are max- tion from which samples ri are taken to calculate the function fi(rJ,

imum-likelihood estimates if the e(N) in (2) are Gaussian. and yi is a gain sequence. It is required of the function f&J that

V. CONCLUDING REMARKS Maximum entropy spectral analysis is equivalent to least-squares

fitting of an all-pole model to the available data. The order of the model is determined by the number of available mean-lagged products. This implies the possibility of introduction of poles that were not actually present in the process. On the other hand, if the process contains both poles and zeros a relatively large number of poles is needed to ap- proximate the true spectrum sufficiently [7]. In this case the number of available mean-lagged products may be too small.

Therefore, it seems more adequate to use parametric methods such as the minimum residual methods developed by Astrom and Steiglitz [7], [S] for models with both poles and zeros or conventional fitting of all-pole models, since these approaches provide in addition tests for the order of the model.

A. VAN DEN Bos Dep. Appl. Phys. Delft Technol. Univ. Lorentzweg 1 Delft, Holland

REFERENCES ill R. B. Blackman and J. W. Tukey, The Measurement of Power St~ctra. New York:

Dover, 1958. i21 N. R. Zagalsky, “Exact spectral representation of truncated data,” Proc. IEEE

(Lett.), vol. 55, Jan. 1967, pp. 117-118. [3] J. P. Burg, “Maximum entropy spectral analysis,” presented at the 37th Annu.

Meetinn Sot. of Exoloration Geonhvsicists. Oklahoma Citv. Okla.. 1967. i41 C. E. Shannon and W. Weaver; The M~thematicai The& of &mmunication.

Urbana, Ill.: University of Illinois Press, 1949, pp. 5457. [51 M. G. Kendall and A. Stuart, The Aduanced Theory of Statistics, vol. 3. London:

Griffin, 1966, pp. 4766481. i61 H. B. Mann and A. Wald, “On the statistical treatment of linear stochastic differ-

ence equations,” Econometrica, vol. 11, July/Ott. 1943, pp. 173-219. 171 S. A. Tretter and K. Steiglitz, “Power-spectrum identification in terms of rational

models,” IEEE Trans. Automat. Contr. (Short Papers), vol. AC-12, Apr. 1967, pp. Isa-188.

IS] K. .I. Astrb;m and T. Bohlin, “Numerical identification of linear dynamic systems from normal operating records,” Proc. IFAC Symp. Self-Adaptive Control Systems, Sept. 1966, pp. 3.2-l-3.2-9.

New Algorithm for Stochastic Approximation

4Mn)l = x. (2) Two forms have been used for f:(rJ in previous work, namely,

.f,WA = r, (3) and

(4)

With the appropriate gain sequence yi, Fu has shown that using (4) rather than (3) in (1) gives a faster decrease in the expected mean-square error.

It is suggested that the function f be chosen as

fnVl(r,,+ d = (R,+,(Z))+ where A,,, ,(I) is an estimate of the sample autocorrelation function of the samples rI,r2;. .,r,+l.

By definition,

R,,(Z) = 4m-~rd (6) and if

r, .= x + c, (7)

where r, is the value of an element of zero-mean Gaussian white noise, then combining (6) and (7) gives

&(I) = -ai[x + 5”-Jx + r,11, (8) which reduces to

or I?,$) = x2 + k5J) (9)

since the cross terms in (8) vanish. Now combining (9) with (5) and (1) and subtracting x from both

sides gives

@“,I - xl = (1 - Y”+mn - xl + Yn+1Wn+1 (10)

where, for x # 0,

VI.+1 =x((i+%?/!)i- 1).

Abstract-A general stochastic approximation algorithm is given along Now iterating (10) and writing it in a closed form gives with assumptions and conditions necessary to show that it converges. Convergence is proven in the mean-square sense. The rate of convergence (-E-,+ 1 is shown to be better than two algorithms proposed previously.

(12) Manuscript received March 25, 1970; revised November 25, 1970. if the void product is taken as unity. Combining (11) and (12) and

232

The optimal Fourier filter, on the other hand, is given by [S]

HF = C,C2-’ (27) with residual mean-square error

&F 2- - ; T,(C, - CIcy’C,),

where C,, C2, and C, are the circulant approximations to T,, T2, and T,, respectively. Thus, the performance degradation can be written

&F2 - &or, 2- - +W,T~-~T~ - c1c2-*cl)

5 IT, - C~/[/lT,ll~ llT~-~/l + llC,ll . IIG-‘Ill + IT2 - C,l. /IT, II . l/Cd. IIT,-‘II . IIG-‘Il. (29)

Assuming further that T,, T,, and their inverses are uniformly bounded in strong norm, we find that the performance degradation is bounded by a linear combination of two Hilbert-Schmidt norms: IT, - C,l and IT2 - C,l.

Equation (29) can be further simplified for the case of additive white noise. Here T, = T, + NJ, where N, is the noise variance, and similarly C2 = C, + N,I. Substituting in (29) and using (23) we obtain

&F2 - &,p2 I IT, - Cl I Mf 11 m, + No I .

(30)

V. CONCLUSIONS

We have demonstrated that the DFT of a stationary time series with uniformly bounded covariance matrix is asymptoti- cally equivalent to the Karhunen-Loeve expansion. We have also demonstrated that when a finite-order Markov process is coded or filtered in the Fourier domain a performance degradation results with magnitude inversely proportional to the square root of the number of samples. These results establish quantitative measures to guide the choice between the computa- tionally efficient Fourier method and the statistically optimal Karhdnen-Loeve processing of real-time stationary data.

ACKNOWLEDGMENT

The author appreciates his helpful discussions with Dr. R. Gray and Dr. L. Davisson. Dr. Gray’s report, entitied “Toeplitz and Circulant Matrices: A Review,” [7], provided invaluable assistance throughout the course of this work. The proof that IC, - TN1 -+ 0 for square-summable t(k) is due to M. Hamidi of the University of California, Los Angeles.

REFERENCES

111 121

[31

[41

PI

161

[71

W. Pratt and H. Andrew, “Transform image coding,” Univ. Southern California, Los Angeles, Elec. Eng. Rep. 387, Mar. 1970. W. R. Crowther and C. M. Rader, “Efficient coding of vocoder channel signals using linear transformation,” k’roc. IEEE (L&t.), vol. 54, pp. 1594-1595, Nov. 1966. K. Fukunaga and W. L. G. Koontz, “Application of the Karhunen- Loeve expansion to feature selection and ordering,” IEEE Trans. Cornput., vol. C-19, pp. 311-318, Apr. 1970. H. C. Andrews and K. L. Caspari, spectral analysis,”

“A generalized technique for IEEE Trans. Comput., vol. C-19, pp. 16-25, Jan.

1970. J. Pearl, “Basis-restricted transformations and performance measures for spectral representations,” IEEE Trans. Inform. Theory (Corresp.), vol. IT-17, pp. 751-752, Nov. 1971. W. R. Davenport and W. L. Root, An Intrqduction to the Theory o/ Random Sign& and Noise. New York: McGraw-Hill, 1958, pp. 93-96. R. M. Gray, “Toeplitz and circulant matrices: A review,” Stanford Univ. Rep. SU-SEL-71-032, June 1971, pp. 37; also “On the Asymp; totic eigenvalue distribution of Toeplitz matrices,” IEEE Trans. Inform. Theory, pp. 725-30, Nov. 1972.

PI R

1101

fEEE TRANSACTIONS ON INFORMATION THEORY, MARCH 1973

U. Grenander and G. Szego, Toeplitz Forms and Their Applications. Berkeley and Los Angeles: Univ. California Press, 1958, p. 62. A. N. Kolmogorov, “On the Shannon theory of information trans- mission in the case of continuous signals,” IRE Trans. Inform. Theory, vol. IT-2, pp. 102-108, Dec. 1956. J. Pearl, H. C. Andrews, and W. K. Pratt, “Performance measures for transform data coding,” IEEE Trans. Commun. (Concise Paper), vol. COM.20, pp. 41 l-415, June 1972.

Notes on Maximum-Entropy Processing

J. A. EDWARD AND M. M. FITELSON

As@+--Maximum-entropy processing is a method for computing the power density spectrum from the first N lags of the autocorrelation function. Unlike the discrete Fourier transform, maximum-entropy processing does not assume that the other lag values are zero. Instead, one mathematically ensures that the fewest possible assumptions about unmeasured data are made by choosing the spectrum that maximizes the entropy for the process.

The use of the maximum entropy approach to spectral analysis was introduced by Burg [l]. In this correspondence, the authors derive the maximum-entropy spectrum by obtaining a spectrum that is forced to maximize the entropy of a stationary random process.

I. INTRODUCTION

Maximum entropy processing is a method for computing the power density spectrum from the first N lags of the autocorrelation function. As such, it may be applied to spectral analysis problems normally approached by the use of the discrete Fourier transform, such as the computation of frequency power density spectra for stationary time series, or of wave number spectra for sonar or seismic arrays. Unlike the discrete Fourier transform, maximum entropy brocessing does not assume that the other lag values are zero. Instead, one chooses the spectrum that fits the measured data and maximizes the entropy of the process. This means one has mathematically ensured that the fewest assumptions regarding the information content of unmeasured data have been made.

The use of this technique for estimating power spectra was introduced by Burg [I]. His paper did not contain an explicit derivation of this result. However, in 1969, Bernard [2], and independently, Edward [3] presented similar derivations (neither of which are available in the open literature) of the maximum entropy spectrum. Subsequently, in 1971, VanDenBos [4] presented a derivation based on a least-squares fitting of an all-pole model to the data. In addition, Lacoss [5] presented quantitative results based on maximum-entropy spectrum analysis.

In this paper the authors present a more rigorous version of Edward’s derivation [3]. The maximum entropy spectrum is found by obtaining a spectrum that is forced to maximize the entropy for a stationary random process and uniformly @aced data. The only assumptions that are necessary arise from demanding that the maximum-entropy spectrum have the analytical properties of the spectrum of the process in question.

The fit to measured data is obtained by requiring the second- order moments (i.e., the covariance matrix) for the spectrum to agree with estimates based on the measured data. A maximum

Manuscript received March 22, 1972; revised September 1, 1972. The authors are with the General Electric Company, Syracuse, N.Y.

13201.

CORRESPONDENCE 233

entropy random process with constraints on only the first and second moments is readily shown to be a Gaussian process. The entropy for a stationary Gaussian process may be shown to be given by

H = $ lnf(l) dA, s

(1) f

where the integral is over the support off(A), and Iz, is the measure of the support of f(A). The maximum entropy estimate of f(A) is derived in Section II.

II. THE MAXIMUM-ENTROPY SPECTRUM

By definition,

I O" f‘(a) = dz exp (-2nirl)E(X(t)X(t + t))

-03

s

cc E dr exp (- 2rcirA)p(r), (2)

-m

where X(t) is the stationary Gaussian process in question. Suppose one had the covariance matrix for X(t,),X(t,),. . ., X(t,), where since the t are uniformly spaced it may be assumed that ti = i. The covariance matrix satisfies

Rij = E(XiXj) = Rj,* = Yt-j, (3) where

r, = s

da exp (27dak)f(a). (4)

Note that R is Hermitian and thus has 2n - 1 (n real and it - 1 imaginary) independent components. To find f(A) so that H is maximized and (4) is satisfied, introduce the quantity

n-1

L = - C O,*r,, k=-(n-1) (5)

where the Lagrange multipliers (0,) satisfy

9, = ewk*. (6) Then f(A) is found by demanding that

6(L + H) = 0, (7)

where the variation is with respect to f(1). Using (1) and (4), (7) becomes

= 0. (8) Since Sf (A) is arbitrary one has

f(a) = !

i

1 ___

At k=x:1, 1,

(9) @k* exp (2nilk)

Imposing the condition that f (A) is positive and integrable, it is possible to express f(A) as

(10) where

II-1

A(a) = c yk* eXp (2nilk) k=O

n-1--m em*= c YkYk*+ m (11)

k=O

r 1

iy -0 1

Fig. 1. Integration contour.

and by Weld’s method [6], [7] A(A)-’ may be chosen to be analytic in the upper half plane. It now remains to solve for {yk: k = O,..., n - 1) by satisfying the k constraint equations

rk = s

112 daf(a) exp (- 2niAk), k = O,l,.. ., n - 1, (12)

-l/2

where use has been made of the periodicity of A(1) and A, has been taken to be 1. Multiplying both sides of (10) by A*(A) exp (2ziAk) and integrating yields

n-1

c s

112

Rk+l,l+lYl = lr0 da n-1 exp (2nikl)

-l/2 2 ym* exp (27ciml) * (13)

m=O

Since A(a)-1 is analytic in the upper half-plane, the integral on the right-hand side of (13) may be performed by integration around the contour shown in Fig. 1.

Due to the periodicity of A(A), the integrals over path @ and @ cancel. This fact combined with Cauchy’s theorem implies that the integrals over paths 0 and @ are the negatives of each other. Therefore

s

l/2 da exp (2aiAkj -l/2 44

s

l/Z = da exp (2nirlk - 2nyk)

n-1 -- . (14) -112

22 ym* exp (2rciAm - 2nym) m=o

One may take y to be arbitrarily large, so that

s

1/Z da exp (phlk) = 1 lim

s

l/2

44 d,l exp (2ziAk - 2zyk)

-l/2 Yo y-m -l/2

= 1. a,,,. (15) YO

Combining (15) with (13) yields,

which has the solution, in an obvious matrix notation,

y = (l/y,)R”& This finally yields

(17)

fm = (R-'),,/(IR-'6)T&(a)2)12, (18)

where e’(A) = [I,. e e, exp (- 2ni(n - l)A)]. The right-hand side of (18) is the desired maximum entropy spectrum.

234 IEEETRANSACTIONSONINFORMATIONTHEORY,MARCH 1973

ACKNOWLEDGMENT

The authors would like to thank the referees and Prof. T. Kailath for their helpful suggestions, which improved the presentation of this correspondence.

REFERENCES [l] J. P. Burg, “Maxi,mum entropy spectral analysis,” presented at the

;;;; Annual Meeting Sot. Explor. Geophys., Oklahoma Ctty, Okla.,

[2] T. E: Bernard,.“Analytical studies of techniques for the computation of high-resolution wave-number spectra,” Texas Instruments Advan. Array Res., Spec. Rep. 9, 1969.

[3] J. A. Edward, “An approach to passive sonar array processing based on the information in the space/time covariance matrix,” General Electric Co., Heavy Mil. Electron. Syst., TIS Rep. R69EMH38, 1969.

[4] A. VanDenBos, “Alternative interpretation of maximum entropy spectral analysis,” IEEE Trans. Inform. Theory (Corresp.), vol. IT-17, pp. 493494, July 1971.

[5] R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophysics, vol. 36, pp. 661-675, Aug. 1971.

[6] E. A. Robinson, “Predictive decomposition of time series with applica. tions to seismic exploration,” Massachusetts Inst. Technol., GAG Rep. 7, sect. 2.8, pp. 185-186, July 12, 1954.

[7] H. Wold, A Study in the Analysis of Stationary Time Series. Almquist and Wiksells, 1938, pp. 214.

Dppsala:

Binary Decisions With Stochastic Inputs

G. FRENKEL

Abstrucf-Upper and lower bounds on the mean error rate for binary decisions are obtained in the case where the amplitude of the signal varies during the symbol period and is a stationary process.

The problem considered is the one of obtaining bounds on the mean error rate in a binary decision process, when the amplitude o(t) of the input signal varies during the symbol period T, with probability density function p(w), which is independent of time during the symbol period (i.e., w(t) is a first-order stationary stochastic process). Under these conditions the decision process is characterized by the relationships: probability of error = P,(x; N,), where

s

T

x= w(t)h(T - t) dt. (1) 0

Here No is the power density of thermal noise, P,(x) is a function, which in general depends on the nature of the decision process, and h(t) is the low-frequency equivalent of the pre: detection filter impulse response, which is assumed positive over the symbol period, as is true in most applications. Equation (1) implies that the discussion is limited to linear detectors; in general if p(w) is known, better results may be obtained without this restriction. Without loss of generality it will be assumed that

s

T h(t) dt = 1. (2)

0

In the following, Pa(x), h(t), and No are assumed known. It is further assumed that P,(x) is a convex function (has a nonnegative second derivative). The reader may verify that this condition is almost always satisfied. For example, in the presence of thermal noise the error rate as a function of the signal component x at the output of a correlator is a convex function of x for all positive values.

Manuscript received March 31, 1972; revised September 7, 1972. The author is with the Computer Sciences Corporation, Falls Church, Va.

It will be shown that the mean error rate is bounded by the following expression

f’e[Eb);No I 5 E[f’&;No)l 5 E[Pe(w; No)], (3)

where

E P’& ; No) 1 = s

Wmax PAwi NolAw) da (4)

wni” A nice heuristic interpretation of (3) exists. Since P,(x) is

convex in x, distributions that have small variance in the received signal energy will have the lowest error probability, averaged over x. The distribution with least variance places all its mass at x = E(x) and has zero variance. By choosing w(t) = E(w) we achieve this situation and minimize P(e) since received signals have equal energy. This explains the lower bound.

The worst situation is one in which the received signals have the maximum possible spread in energy. If we constrain w(t) to be distributed as p(w) for each t, then any changes in w(t) tend to smooth out the energy distribution and lower P(e). Therefore choosing w(t) constant for each sample function, but with o(t) distributed according to p(o) for each t, maximizes P(e), yielding the upper bound.

More rigorous proofs follow, with the upper bound being derived first. From the convexity of P,(x), we have the inequality [l 1.

IS

T

I s

T

PC? w(t)h(T - t) dt; No 5 P,[w(t); N,]h(T - t) dt. 0 0

(5)

This is a generalization of the inequality for convex functions PI

PET ( ) q 5 3Pe(Y> + PJY’).

Taking the expectation of both sides of (5)

s

T

E [Petx; NoI1 c: E[P,(o; N,)]h(T - t) dt 0

or

s

T

E [PAX ; No) 1 2 E [P&J ; No) I h(T - t) dt. 0

The integral equals unity, yielding the upper bound. The lower bound is obtained from the expansion

P,(x) = P&o) + (x - xo)pe’(xo) + (c!+!!22 P/(X1),

where xi E [x0,x]. With P,“(xl) 2 0, P,(x) 1 P,(x,) + (x - x,)P,‘(x,). With x0 = E(x), taking expectations E [P,(x)] 2 P,[E(x)]. From (1) and (2) E(x) = E(w) yielding the lower bound.

An example of these bounds is the behavior of a partially coherent detector. The signal is tracked in a phase-locked loop, with phase error d(t). Under the assumption of a constant phase error over the symbol interval, the error rate is given by the upper bound in (3) E [p&41 = .f P, [x(&b(d) &4 where Ad) is the probability density of 4. Some values of this error rate were computed [3]. For an energy per symbol-to-noise-density ratio equal to 9.1 and a mean-squared phase error of 0.1 rad2, the error rate equals 2.2 x 10m4. The signal component w(t) in (1) is equal to CD,,, cos 4 where w,,,~~ is the correlator input for zero phase error. For small phase errors, taking the first two

Maximum Entropy Spectra of Red Noise Processes

1.  Generate a realization of red noise: •  Spectral domain:

–  Generate complex white, Gaussian noise –  Shape it according to [S(f)]1/2

–  Inverse FFT to time domain

2.  Find the best fit autoregressive model by minimizing fitting error against the ‘order’ of the model

•  AR model fitting is equivalent to maximizing entropy

3.  Find the Fourier spectrum of the AR model

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−4

10−3

10−2

10−1

100

101

Spe

ctru

m

Black: Generated spectrum Blue: Periodogram Red: AR Spectrum

S(f) ∝ f−0.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−4

10−3

10−2

10−1

100

101

102

103

Spe

ctru

m


S(f) ∝ f−1.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

104

105

Spe

ctru

m


S(f) ∝ f−2.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−9

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100101102103104105

Spe

ctru

m


S(f) ∝ f−3.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−12

10−10

10−8

10−6

10−4

10−2

100

102

104

106

108

Spe

ctru

m


S(f) ∝ f−4.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−15

10−13

10−11

10−9

10−7

10−5

10−3

10−1

101

103

105

107

109

Spe

ctru

m


S(f) ∝ f−5.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

1001021041061081010

Spe

ctru

m


S(f) ∝ f−6.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−4

10−3

10−2

10−1

100

101

102

103

Spe

ctru

m


S(f) ∝ f−1.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

104

105

Spe

ctru

m


S(f) ∝ f−2.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−9

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100101102103104105

Spe

ctru

m


S(f) ∝ f−3.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−12

10−10

10−8

10−6

10−4

10−2

100

102

104

106

108

Spe

ctru

m


S(f) ∝ f−4.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−15

10−13

10−11

10−9

10−7

10−5

10−3

10−1

101

103

105

107

109

Spe

ctru

m


S(f) ∝ f−5.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

1001021041061081010

Spe

ctru

m


S(f) ∝ f−6.0

Maximum Entropy Spectral Estimate

So far we know how to calculate the entropy of a random variable in terms of its PDF. For a

univariate Gaussian PDF we have

fX(x) =�2πσ2

�−1/2e−x

2/2σ2

H = −�

dxfX(x) ln fX(x)

=

�1

2ln(2πσ2) +

X2

2σ2

�

=1

2

�ln(2πσ2) + 1

�

=1

2

�ln(2πeσ2)

�

When we maximize the entropy subject to constraints (from data), we only care about terms in

the entropy that depend on relevant parameters. Here the only parameter is σ so the constant

term does not matter. Notice that larger σ implies larger entropy, as we would expect for a

measure of uncertainty.

When we maximize entropy, we may as well write

H ≈ ln σ + constant.

1

Multivariate Case:

Consider a real Gaussian random process xk, k = 1, . . . , N whose correlation function for Nlags can be written as an N ×N covariance matrix Cx. For the zero mean case,

Cx =

�x21� �x1x2� . . . �x1xN�... �x22� . . . �x2xN�... . . . ...

�xNx1� . . . . . . �x2N�

Since the random process is continuous, we use the integral expression for the relative entropy(dependent on the coordinate system)

H = −�

d�x f�x(�x) ln f�x(�x)

withf�x(�x) =

�(2π)N det Cx

�−1/2exp

�− 1

2(�x− �µ)t Cx

−1(�x− �µ)�

which yields

H =1

2ln�(2π)N det Cx

�+

1

2

�(�x− �µ)tCx

−1(�x− �µ)�

2

We will

1. ignore the factor (2π)N because it is constant in Cx.2. ignore the second term because it is independent of both N and Cx. This is equivalent to

the constant term we found for the univariate case.

Example of a bivariate Gaussian:

Cx =

�σ21 σ1σ2ρ12

σ1σ2ρ12 σ22

�,

Cx−1 = (detCx)

−1

�σ22 −σ1σ2ρ12

−σ1σ2ρ12 σ21

�

Cx−1

�δx1

δx2

�= (detCx)

−1

�σ22δx1 − σ1σ2ρ12δx2

−σ1σ2ρ12δx2 + σ21δx2

�

Q ≡ (δx1 δx2)Cx−1

�δx1

δx2

�

�Q� = (detCx)−1

�σ22σ

21 − 2σ2

1σ22ρ12 + σ

21σ

22

�=

2σ21σ

22 (1− ρ

212)

σ21σ

22 (1− ρ

212)

=1

2

Since this is a constant, we will ignore it in taking derivatives of H .

3

We therefore use an entropy expression,

H ≡ 1

2ln (detCx)

Unfortunately, as N → ∞, H → ∞ as can be seen for the uncorrelated case where the covari-ance matrix is diagonal:

⇒ H =1

2ln

� N�

j=1

σ2j

�=

1

2

N�

j=1

ln σ2j

Define an entropy rate as

h = limN→∞

H

N + 1

= limN→∞

1

2

1

(N + 1)ln (detCx)

= limN→∞

1

2ln

�(detCx)

1N+1

�=

1

2ln

�lim

N→∞(detCx)

1N+1

�

(1)

4

Entropy in terms of the spectrum:

To get a maximum entropy estimate of a spectrum, we need an expression for the entropy in

terms of the spectrum. There is no general relation between the spectrum and the entropy. For

Gaussian processes, however, there is a relation. This is appropriate since a Gaussian process

is one with maximum entropy out of all processes with the same variance. The spectrum is the

variance per unit frequency, so this conceptual step is important. But a relation exists1

between

the determinant of the covariance matrix and the spectrum, which is assumed to be bandlimited

in (−fN, fN).

limN→∞

(detCx)1

N+1 = 2fN exp

�1

2fN

� fN

−fN

df ln Sx(f )

�.

The theorem depends on Cx being a Toeplitz matrix [matrix element Cij depends only on

(i− j)], i.e. that the process be WSS

1An arcane proof exists in “Prediction-Error Filtering and Maximum-Entropy Spectral Estimation” in Non-Linear Methods of Spectral Analysis, Haykin ed.

Springer-Verlag 1979, see Appendix A, pp. 62-67. It is also given by Smylie et al. 1973, Meth. Comp. Phys. 13, 391.

5

Thus

h = limN→∞

1

2ln (detCx)

1N+1

=1

2ln

�lim

N→∞(detCx)

1N+1

�

=1

2ln 2fN +

1

4fN

� fN

−fN

df ln Sx(f )

Ignoring the first, constant term, we have

h =1

4fN

� fN

−fN

df ln Sx(f )

6

Heuristic “derivation” of the entropy rate expression:

Another way of viewing this is as follows. In calculating a power spectrum we are concernedwith a second-order moment, by definition. Consequently, we can assume that the randomprocess under consideration is Gaussian because:

1. we are maximizing the entropy (subject to constraints) and

2. given the second moment, the process with largest entropy is a Gaussian random process

Note that while this assumption is satisfactory for estimating the power spectrum (a secondmoment), it is not necessarily accurate when we consider the estimation errors of the spectralestimate, which depend on fourth-order statistics. If the central limit theorem can be invokedthen, of course, the Gaussian assumption becomes a good one once again.

Imagine that the process under study is constructed by passing white noise through a linearfilter whose system function is

�S(f )

n(t) −→�

Sx(f ) −→ x(t)

Consequently, the Fourier transforms are related as

N(f )�Sx(f ) = X(f )

Now N(f ) itself is a Gaussian random variable because it is the sum of GRV’s. Therefore,

7

X(f ) is a GRV and viewing it as a 1-D random variable we have that the entropy is

H(f ) =1

2ln [2πeσ2(f )]

but

σ2(f ) ≡ �[N(f )

�S(f )]2� = S(f )SN

Letting the white noise spectrum be SN = 1 we have

σ2(f ) = S(f )

andH(f ) =

1

2ln [2πeS(f )]

Recall that white noise is uncorrelated between different frequencies

�N(f ) N ∗(f �)� = SN δ(f − f�).

Consequently, the information in different frequencies adds because of statistical independenceand, therefore, to get the total entropy we simple integrate (add variances):

H =

�df H(f ) =

1

2

�df ln[2πeS(f )]

8

Again we ignore additive constants and we also consider the case where the signal is bandlim-ited in |f | ≤ fN and the signal is sampled over a time [0, T ]. Therefore, the number of degreesof freedom is 2fNT .

The entropy per degree of freedom is

h =H

2fN=

1

4fN

�fN

−fN

df ln [S(f )]

The derivation of the ME spectrum follows the logical flow:

entropy in terms of power spectrum S(f )⇓

maximize H subject to constraints:

⇓(�C)known

F.T.

⇔ S(f )⇓

ME spectral estimator

van den Bos shows that extrapolating the covariance function to larger lags while maximizingentropy yields a matrix equation that is identical to that obtained by fitting an autoregressivemodel to the data. This implies that the two procedures are identical mathematically.

9

Maximum Entropy Spectral Estimator

By maximizing the entropy rate that is expressed in terms of the spectrum, the spectral estimatecan be expressed as (e.g. Edward & Fitelson 1973),

S(f ) =1

fN

α20

|�ε t �C−1�δ|2

where �C is the Toeplitz covariance matrix, which applies to WSS processes.

�ε =

1e2πif∆τ

...e2πifM∆τ

�C =

C00C01 . . . C0M

...CM0 . . . . . . CMM

�δ =

10...0

Toeplitz ⇒ �C ≡

C0C1 . . . CM... C0

CM . . . . . . C0

Let

�γ ≡ �C−1 =

γ00 . . . γ0M...γM0 . . . γMM

10

Then

�ε t �C−1�δ =M�

j=0

γj0 e2πif∆τj

andS(f ) =

1

4N

α20

��M�

j=0

γj0 e2πif∆τj

��2

By rewriting the sum and redefining the constants this can be written

S(f ) =α20

fN

��γ00 +

M�

j=1

γj0 e2πif∆τj

��

2

=α20

fNγ200

��1 +

M�

j=1

γj0γ00

e2πif∆τj

��

2

11

Thus, the ME spectral estimate can be put into the form

S(f ) =PM

��1 +M�

j=1

αj e2πifj∆τ

��2

where PM = a constant that properly normalizes the spectrum.

This is the same spectrum as for an Mth order AR process that can be fitted to the data, where

the coefficients are determined by least squares.

Spectrum of an AR Process:

Consider the following M-th order AR process

xt = at��white noise

+M�

j=1

αj xt−j

� �� autoregressive part

A zero-th order process would be xt = at (i.e. white noise). Scargle would term the above

definition a causal AR process. An acausal or two-sided process would allow negative values

of j in the sum on the RHS.

The coefficients αj, j = 1, . . . ,M are the AR coefficients. In the fitting of an AR model to the

data, one must determine the order M as well as the M coefficients.

12

Define the DFTs

X(f ) ≡N−1�

t=0

xt e−2πift/N

A(f ) ≡N−1�

t=0

at e−2πift/N.

Substituting the definition for the AR process, we have

X(f ) =M�

j=1

αj X(f ) e−2πifj/N + A(f ),

and, solving for X(f ),

X(f ) =A(f )

�1−

M�

j=1

αj e−2πifj/N

�.

13

The spectrum is then

S(f ) =|A(f )|2

��1−M�

j=1

αj e−2πifj/N

��2

∝ 1��1−

M�

j=1

αj e−2πifj/N

��2.

As is obvious, the AR spectrum has the same form as the maximum-entropy spectrum.

14

Matrix form of the Fourier-Transform Based SpectralEstimate:

It is instructive to compare the matrix form for the maximum entropy spectrum with the powerspectral estimate defined as the Fourier transform of the autocorrelation function. This is iden-tical to the spectrum found by taking the squared magnitude of the Fourier transform of thetime series and is sometimes called the Bartlett estimate, because the Bartlett lag window is atriangle function.

Let Cn−n� be the n, n� element of the covariance matrix; then the Bartlett estimate is

SB(f ) = N−1N+1�

�=−N+1

C�

�1− |�|

N

�e−2πf�∆τ ,

which can also be written as

SB(f ) = N−2�

�

�

��

C�−�� e−2πif∆τ(�−��)

= N−2�

�

�

��

e2πif∆τ��C�−��e−2πif∆τ�,

and can be written as a matrix product

SB(f ) = N−2�ε tC�ε ∗

15

This can be compared with the ME and high-resolution (“ML”)estimates

SME =constant

|�ε tC−1�δ|2

SML(f ) =1

�ε tC−1�ε

16

On the Applicability of Maximum Entropy Spectral Estimate:

The spectral estimator:

Makes use of the 2M+1 values of the covariance function that are known or estimated.There is no choice in M . If ρM → 0 then the spectral estimate will reflect that (Jaynessays that the corresponding Lagrange multiplier will be zero).

The AR Approach:

M appears to be a parameter that must be chosen according to some criterion.

Reconciliation:

Jaynes is correct so long as the expression used for the entropy is correct. It may notbe in some cases. If it is ok, then simply use the 2M + 1 known values. If the entropyexpression is not applicable, then one must view the situation as one where an ARmodel is being fitted to the data, with M an unknown parameter.The problem reduces to finding (i) the order of the AR process and (ii) the coefficients.

17

Example: Suppose one knows the AR process is zero-th order, then αj = 0 for j ≥ 1

The MEM spectral estimate is

S(f ) = PA = σ2a/N = constant

Contrast this with a Fourier transform estimate which would look like

and would become smooth only if one smoothed over all spectral values.

The question (again) is how does one know or determine the proper AR order to use?

18

Estimates for AR coefficients For all the nitty-gritty details of calculation of AR coefficients,

see Ulrych and Bishop, “Maximum Entropy Spectral Analysis and Autoregressive Decompo-

sition” 1975, Rev. Geo, and Sp. Phys., 13, 1983. There are Fortran listings for the Yule-Walker

and Burg algorithms for estimating coefficients. See also Numerical Recipes.

Two problems remain:

1. How does one calculate the order of the AR model?

2. What are the estimation errors of the spectral estimate?

The order of the AR Model can be estimated by looking at the prediction error as a function of

order M .

With N = # data points and M = order of AR process (or of a closely related prediction error

filter), evaluate the quantity (the “final prediction error”)

(FPE)M =N + (M + 1)

N − (M + 1)� �� increases as M increases

�

t

(xt − xt)2

� �� decreases

The order M is chosen as the one that minimizes the FPE (The Akaike criterion).

19

Signal Modeling, Statistical Inference and Data Mining in...

Documents

Transcript of Signal Modeling, Statistical Inference and Data Mining in...