Adaptive Filtering and Change Dectection

7/25/2019 Adaptive Filtering and Change Dectection

http://slidepdf.com/reader/full/adaptive-filtering-and-change-dectection 1/495



Adaptive Filtering andChange Detection

Fredrik GustafssonLinkoping University Linkoping Sweden

JOHN WILEY SONS, LTDChichester Weinheim New York Brisbane Singapore Toronto



Copyright 2000 by John Wiley Sons LtdBaffins Lane, Chichester,West Sussex, PO 19 UD, England

National 01243 779777International (+44) 1243 779777

e-mail (for orders and customer service enquiries): [email protected]

Visit our Home Page on http://www.wiley.co.ukor

http://www.wiley.com

All Rights Reserved. No part of this publication may be reproduced, stored n a retrieval system, ortransmitted, in any form or by any means, electronic, mechanical, photocopying, recording, scanning orotherwise, except under the terms f the Copyright Designs and Patents ct 1988 or under the terms f alicence issued by the Copyright Licensing Agency, 0 Tottenham Court Road, London, W1P 9HE, K,without the permission n writing of the Publisher, with the exception f any material supplied specificallyfor the purpose f being entered and executed on a computer system, for exclusive use by the purchaser fthe publication.

Neither the authors nor John Wiley Sons Ltd accept any responsibility or liability for loss or damageoccasioned to any person or property through using the material, instructions, methods or ideas containedherein, or acting or refraining from acting s a result of such use. The authors and Publisher expresslydisclaim all implied warranties, including merchantability f fitness for any particular purpose. There willbe no duty on the authors f Publisher to correct any errors r defects in the software.

Designations used by companies to distinguish their products are often claimed as trademarks. n allinstances where John Wiley Sons is aware of a claim, the product names appear n initial capital orcapital letters. Readers, however, should contact the appropriate companies for more complete informationregarding trademarks and registration.

Other Wiley Editorial O fic es

John Wiley Sons, Inc., 605 Third Avenue,New York, NY 10158-0012, USA

Wiley-VCH Verlag GmbHPappelallee 3, D-69469 Weinheim, Germany

Jacaranda Wiley Ltd, 3 Park Road, Milton,Queensland 4064, Australia

John Wiley Sons (Canada) Ltd, 22 Worcester RoadRexdale, Ontario, M9W 1L1, Canada

John Wiley Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01,Jin Xing Distripark, Singapore 129809

British Library C ataloguing in Publication Data

A catalogue record for this book s available from the British Library

ISBN 0 471 49287 6

Produced from PostScript files supplied by the authorsPrinted and bound n Great Britain by Antony Rowe, Chippenham, WiltsThis book is printed on acid-free paper responsibly manufactured from sustainable forestry,in which at least two trees are planted for each one used for paper production.



Contents

Preface ix

Part I: Introduction 1

1 Extendedummary1.1. About t he book . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2. Adaptive inear iltering 81.3. Change etection . . . . . . . . . . . . . . . . . . . . . . . . . 171.4. Evaluation and formal design 26

2 Applications 312.1. Change in th e mean model . . . . . . . . . . . . . . . . . . . 322.2. Change in th e variance model 352.3. FIR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.4. AR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.5. ARX model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.6. Regression model 462.7. State space model . . . . . . . . . . . . . . . . . . . . . . . . 49

2.8. Multiple models 492.9. Parameterized non-linear models . . . . . . . . . . . . . . . . 51

Part II: Signalstimation

3 On linepproaches 7

3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2. Filtering pproaches 593.3. Summary of least squares approaches . . . . . . . . . . . . . . 593.4. Stopping rules and he CUSUM test 63



vi Contents

3.5. Likelihood based change detection 703.6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813 A Derivations 84

4 . Off line approaches 894.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.2. Segmentation riteria 914.3. On-line local search for optimum . . . . . . . . . . . . . . . . 944.4.Off-line global search for optimum 984.5. Change point estimation . . . . . . . . . . . . . . . . . . . . . 1024.6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Part : Parameterstimation 111

5 Adaptiveiltering5.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.2. Signal models . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.3. System dentification 1215.4. Adaptive lgorithms . . . . . . . . . . . . . . . . . . . . . . . 1335.5. Performance analysis 1445.6. Whiteness based change detection . . . . . . . . . . . . . . . 1485.7. A imulation xample 1495.8. Adaptive filters in communication 1535.9.Noise cancelation . . . . . . . . . . . . . . . . . . . . . . . . . 1675.10. Applications 1735.11. Speech coding in GSM . . . . . . . . . . . . . . . . . . . . . . 1855.A. Square root mplementation 1895.B. Derivations 190

6 . Changeetectionased onlidingindows 2056.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2056.2. Distance measures . . . . . . . . . . . . . . . . . . . . . . . . 2116.3. Likelihood based detect ion and isolation 2186.4. Design optimization . . . . . . . . . . . . . . . . . . . . . . . 2256.5. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

7 Changeetectionased onilteranks317.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2317.2. Problem etup 233



Contents vii

7.3. Sta tist ica l criteria 2347.4. Information based criteria . . . . . . . . . . . . . . . . . . . . 2407.5. On-line local search for optim um 242

7.6. Off-line global search for optimum . . . . . . . . . . . . . . . 2457.7. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2467.A. Two inequalities for likelihoods . . . . . . . . . . . . . . . . . 252

7.B. The posterior probabilities of a jump sequence . . . . . . . . 256

Part W Statestimation 261

8 Kalmaniltering8.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2648.2. St at e space modeling 2678.3. The Kalman filter 2788.4. Time-invariant signal model 2868.5. Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2908.6. Computational aspects . . . . . . . . . . . . . . . . . . . . . . 2958.7. Square oot mplementation . . . . . . . . . . . . . . . . . . . 3008.8. Sensor fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 3068.9. The extended Kalman filter . . . . . . . . . . . . . . . . . . . 313

8.10. Whiteness based change detection using th e Kalm an filter . . 3248.11. Estimat ion of covariances in st at e space models 3268.12. Applications 327

9 Changeetectionasednikelihoodatios439.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3439.2. Th e likelihood approach 3469.3. Th e GLR test 349

9.4. Th e MLR test 3539.5. Simulation tudy . . . . . . . . . . . . . . . . . . . . . . . . . 3659.A. Derivation of the GLR test . . . . . . . . . . . . . . . . . . . 3709.B. LS-based derivation of the MLR test . . . . . . . . . . . . . . 372

I 0 Change etectionasednultipleodels 37710.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37710.2. Examples of applications . . . . . . . . . . . . . . . . . . . . . 37810.3. On-line algori thms . . . . . . . . . . . . . . . . . . . . . . . . 38510.4. Off-line algorithms 39110.5. Local pruning in blind equalization 395



viii Contents

10.A.Posterior distribution 397

11 Change detection based on alg ebr aic al consistency ests 403

11.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40311.2. Parity space change detection . . . . . . . . . . . . . . . . . . 40711.3. An observer approach 41311.4. An input -output appro ach . . . . . . . . . . . . . . . . . . . . 41411.5. Applications 415

Part V: Theory 425

12 Evaluationheory 42712.1. Filter evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 42712.2. Evaluation of change detectors 43912.3. Performance optimization . . . . . . . . . . . . . . . . . . . . 444

13 . li near estimation5113.1. Projec tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45113.2. Conditional expectations 45613.3. Wiener filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 460

A . Signalodels and notation71

B. Fault detectionerminology 475

Bibliography 477

Index 493



PrefaceThis book is rather broad in that it covers many disciplines regarding bothmathematical tools (algebra, calculus, statistics) and application areas (air-borne, automotive, communication and standard signal processing and auto-matic control applications). The book covers all th e theory a n applied en-gineer or researcher can ask for: from algorithms with comple te derivat ions,their properties to implementation aspects. Special emphasis has been placedon examples, applications with real da ta an d case studies for illustrating theideas and what can be achieved. There are more than 130 examples, of whichat least ten are case studies that are reused at several occasions in the book.The practitioner who wants to get a quick solution to his problem may trythe ‘student approach’ to learning, by studying s tandard examples and usingpattern recognition to match them to th e problem a t hand.

There is a strong connection to MATLABTM There is an accompanyingtoolbox where each algorithm in h e book is implemented as one function, each

example is one demo, and where algorithm design, tuning, testing and learningare all preferably done in th e graphical user interface. A demo version ofthe toolbox is available to download from the URL http: / m igmoid. e.The demo toolbox makes it possible to reproduce all examples in th e bookin a simple way, for instance by typing book ’exl. ’1 so all 250 figuresor so are completely reproducible. Further, it might be instructive to tunethe design parameters and compare different methods Th e toolbox worksunder MATLABTM but to some extent also under the freeware clone Octave.From the home page, exercises can be downloaded, about half of which concern

computer simulations, where the toolbox is useful. Further information can befoundontheURLshttp://www.wiley.co.uk/commstech/gustafsson.htmlandhttp://www.comsys.isy.liu.se/books/adfilt.

It might be interesting t o note tha t the toolbox and its structu re camebefore the first plans of writing a book. Th e development of the toolboxstarted during a sabbatical visit at Newcastle University 1993. The out lineand structure of th e book have borrowed many features from th e toolbox.

This book was originally developed during several courses with major re-visions in between them: mini-courses at the Nordic Matlab Conference 1997(50 participants), a course at SAAB during summer 1998 25 participants),ABB Corporate Research September 1998 (10 participants), and a graduatecourse for the graduate school Ecsel at Linkoping University 1998 and 1999



X Preface

25 participants). Parts of the material have been translated into Swedish forthe model-based part of a book on Digital Signal Processing, where about 70undergraduate students participate each year at Linkoping University.

My interest in this area comes from two directions: the theoretical side, be-ginning with my thesis and studies/lecturing in control theory, signal rocess-ing and mathematical statistics; and th e practical side, from the applicationsI have been in contact with. Many of th e examples in this book come fromacademic and professional consulting. A typical example of th e former star tswith an email request on a particular problem, where my reply is “Give merepresentative data and a background description, and I’ll provide you witha good filter”. Many of th e examples herein are the result of such informalcontacts. Professionally, I have consulted for th e automotive and aerospace

industries, and for the Swedish defense industry. There are many industriesthat have raised my interest in this area and fruitfully contributed to a setof benchmark examples. In particular, I would ike t o mention Volvo Car,SAAB Aircraft, SAAB Dynamics, ABB Corporate Research, Ericsson Radio,Ericsson Components and Ericsson-SAAB Avionics. I n addition, a number ofcompanies and helpful contac ts are acknowledged at the first appearance ofeach real-data example. The many industrial contacts, acquired during thesupervision of some 50 master’s theses, at least half of them in target trackingand navigation, have also been a great source of inspiration.

My most challenging task at the time of finishing this book is to partic-ipate in bringing various adaptive filters and change detectors into vehicularsystems. For NIRA Dynamics http: /www.niradynamics se , I have pub-lished a number of patents on adaptive filters, Kalman filters and changedetection, which are currently in th e phase of implementation and evaluation.

Valuable comments an d proof reading are gratefully acknowledged to manyof the course attendants, my colleagues and co-authors. The re are at least 30

people who have contributed to th e er ra ta sheets during the years. I havereceived a substan tial number of constructive comments th a t I believe im-proved the content. In general, the group of automatic control has a quitegeneral competence area tha t has helped me a lot, for instance with crest-ing a Latex style file tha t satisfies me. In particular, in alphabetical order,I wish t o mention Dr. Niclas Bergman, Dr. Fredrik Gunnarsson , Dr. JohanHellgren, Rickard Karlsson MSc, Dr. Magnus Larsson, Per-Johan NordlundMSc, Lic. Ja n Palmqvist, Niclas Persson MSc, Dr. Predrag Pucar, Dr. AndersStenman, Mikael Tapio MSc, Lic. Fredrik Tjarnstrom and MAns Ostring MSc.My co-authors of related articles are also acknowledged, from some of these I

have rewritten some material.Finally, a project of this kind would not be possible without the support of

an understanding family. I’m indebted to Lena an d to my daughters, Rebeccaand Erica. Thanks for letting me take your time



Index

C criterion, 125criterion, 125

x est, 65, 79, 217MAT LAB^^ ix

a posteriori distribution

a posteriori probability, 92stat e, 399

P 97changing regression, 235state change, 360

abrupt changes, 142Acoustic Echo Cancelation, 115, 167adaptive control, 5, 42Adaptive Forgetting through Mul-

tiple Models, 390AEC, 115, 167AFMM, 390AIC, 125, 241AIC, corrected, 125Air Traffic Control, 271, 328, 330alarming, 58algorithm

AFMM, 390blind equalization, 166, 395

CMA, 165decision directed, 165modulus restoral, 165multiple model, 395

blind equalization algorithm, 395Brandt’s GLR, 79CUSUM, 66CUSUM LS, 68CUSUM RLS, 69decentralized KF, 312

extended least squares, 121Gauss-Newton, 129Gauss-Newton ARMAX, 130Gibbs change detection, 394Gibbs-Metropolis change detec-

GLR, 350GPB, 388IMM, 389information KF, 309KF, 279KF parameter estimation, 142likelihood signal detection, 75LMS, 134

local search, 94, 244, 386MCMC segmentation, 246MCMC signal segmentation, 102multi-step, 143multiple model pruning, 386Newton-Raphson, 127NLMS, 136optimal segmentation, 256parame ter and variance detec-

parity space detection, 409recursive parameter segmenta-

recursive signal segmentation,

RLS, 138, 192smooth ing KF, 293, 295SPRT, 65square root KF, 302-305station ary KF, 286steepest descent, 126

tion, 392

tion, 224

tion, 244

94

Adaptive Filtering and Change DetectionFredrik Gustafsson

Copyright © 2000 John Wiley & Sons, LtdISBNs: 0-471-49287-6 (Hardback); 0-470-84161-3 (Electronic)



494 Index

stochastic gradient, 126two-filter MLR, 358Viterbi, 161

Wiener filter, 464WLS, 140

AR, 118, 472arg min, 60ARL, 29, 440ARMA, 120ARMAX, 120ARX, 119, 473association, 330

asymptotic local approach, 149, 214ATC, 271, 328, 330Auto-Regressive, 118Auto-Regressive model with eXoge-

Auto-Regressive Moving Average, 120Auto-Regressive Moving Average model

auto-tuning, 431Average Run Length, 440average run length function, 29

Bayes’ rule, 92bearing only sensors, 329bearings only tracking, 334BER, 154Bernoulli variables, 235bias error, 144, 298, 429BIC, 125, 241Bierman’s UD factorization, 190Binary Phase Shift Keying, 153Bit Error Rate, 154blind equalization, 5, 115, 382, 390blind equalization algorithm, 166BPSK, 153Brandt’s GLR method, 212burn-in time, 438

causal, 427causal Wiener filter, 463

nous input, 119

with eXogenous input, 120

CE estimate, 399central fusion, 307change detection, 58

change in the mean, 34change in the mean model, 58, 471change in the variance model, 472change point estimation, 89, 102change time, 58changing regression, 233

a posteriori probability, 235generalized likelihood, 235marginalized likelihood, 235

state space model, 233clutter, 330CM, 457CMA, 165CMV, 457compensation, 346conditional expectation, 62Conditional Mean, 457conditional mean estimate, 457Conditional Minimum Variance, 457confidence region, 399Constant Modulus Algorithm, 165coordinated turns, 271Correct Past Decisions, 159CPD, 159curse of dimensionality, 90, 232CUSUM, 65

cut off branches, 385

dead-beat observer, 413decentralized filters, 311decentralized fusion, 307decision directed, 165decision error, 158decision feedback equalizer, 158Decision-directed Feedback, 390decoupling, 405dedicated observer, 406Delay For Detection, 440



Index 495

density functionlinear regression

off-line, 196

on-line, 195design parameters

local search, 245detection, 6, 381, 389Detection-Estimation Algorithm, 391deterministic disturbance, 404deterministic least squares, 122DFD, 440DGPS, 341diagnosis, 6, 218, 475

Differential GPS, 341digital communication, 114distance function, 403distance measure, 19, 64, 76, 86,

divergence, 296

divergence test, 209, 213double talk, 229dynamic programming, 90, 233

echo path change, 229EM, 391equalization, 114, 153, 382, 390equalizer, 153

decision feedback, 158linear, 155minimum variance, 160Viterbi, 160zero forcing, 159

estimation, 58example, see signalsexcessive mean square error, 145Expectation Maximization, 391

exponential forgetting window, 59Extended Kalman Filters, 316extended least squares, 121, 129

medical example, 173

211, 324

Kalman filter, 324

factorial, 85failure signature matrix, 351false al arm rate, 28, 440

FAR, 28, 440far-field scattering, 118fault decoupling, 405fault detection, 6, 475fault isolation, 18FDI, 6filtered-input LMS, 168filtered-X LMS, 168filtering, 59Final Prediction Error, 124Finite Impulse Response, 37, 117,

Finite Moving Average, 59FIR, 37, 117, 428, 472FMA, 59forgetting factor, 9, 61, 138forward dynamic programming, 162forward-backward, 292FP E, 124frequency selective fading, 117fusion filter, 311fusion formula, 311

gamma distribution, 85gamma function, 85, 401Gauss-Newton algorithm, 129

Gaussian mixture, 379, 383, 438generalized likelihood

Generalized Likelihood Ratio, 87,

generalized observer, 406, 413Generalized Pseudo Bayes, 391Generalized Pseudo-Bayesian, 388Generalized Viterbi Algorithm, 390Geometric Moving Average, 59Gibbs change detection algorithm,

428

changing regression, 235, 240

209, 345

394



496 Index

Gibbs sampler, 392Gibbs sampling, 438Gibbs-Metropolis change detection

GLR, 87, 209, 345algorithm, 350

GMA, 59Godard, 165GPB, 388GPS , 337

Hankel matrix, 404

Heisenberg’s uncertainty, 140hidden Markov model, 142Hilbert space, 453hyper model, 274hypothesis test

algorithm, 392

x2 5, 79, 217Gaussian, 79

i.i.d, 437

IIR , 428IMM, 389Inertia l Navigation System, 307Infinite Impulse Response, 428Information Criterion A, 125Information Criterion B, 125information filter, 312input estimator, 344inpu t observer, 344, 406

INS, 307Inter-Symbol Interference, 153Interacting Multiple Model, 391inverse system identification, 114inverse Wishart, 223ISI, 153isolation, 6, 45, 218, 405iterated Kalman filter, 317

Jensen’s inequality, 430jerk model, 270jump linear model, 384

jumping regression, 234

k-step ahead prediction, 428Kalman filter, 11 15, 62

iterated, 317scale invariance, 400

Kalman smoother, 467Kullback discrimination information,

Kullback divergence, 212

law of total probability, 399leaky LMS, 136learning curves, 126Least Mean Square, 11 61, 134least squares, 60

212

deterministic, 122stochastic, 122

least squares over sliding window,

likelihood11

changing regression, 235state change, 362

Likelihood Ratio , 75, 208, 345Lindley’s paradox, 239linear equalizer, 155linear estimators, 62linear filter, 427linear regression, 115, 472

linear regressions, 370LLR, 208LMS, 61, 134, 391

leaky, 136sign data, 136sign error, 136sign-sign, 136

state space model, 233

local approach, 205, 210, 214local scatte ring, 117local search, 386

log likelihood ratio, 208design parameters, 245



Index 497

long term prediction, 188loss function, 207

LR, 75, 208, 345Luenberger observers, 410

MAP, 92, 235, 457MAP estimate

sta te, 397, 399

LQG, 6

Maple, 317, 321MAPSD, 395marginalization, 400

marginalized likelihood

Marginalized Likelihood Ratio, 345,

Markov chain, 384Markov Chain Monte Carlo, 101,

438Markov models, 391matched filter, 164Mathematica, 317, 321matrix inversion lemma, 192Maximum A Posteriori, 457maximum a posteriori, 92Maximum A posteriori Probability,

Maximum A Posteriori Sequence De-tection, 395

Maximum Generalized Likelihood,74, 240

Maximum Likelihood, 71, 236maximum likelihood, 73Maximum Marginalized Likelihood,

MCMC, 90, 101, 233, 438MD, 209MDL, 125, 150, 181, 241, 446MDR, 440Mean Time between False Alarms,

changing regression, 235

353

235

74

440

mean time between false alarms, 28Mean Time to Detection, 440mean time to detection, 29

easurement update.medical diagnosis, 40merging, 379, 385Metropolis step, 392MGL, 74MIMO, 121

385

minimal order residual filter, 420minimizing argument, 60Minimum Description Length, 181,

241, 446minimum description length, 125minimum mean square error, 122Minimum Variance, 458minimum variance equalizer, 160minimum variance estimator, 62misadjustment, 145Missed Detection Rate, 440missing data, 384ML, 71, 236MLR, 345, 353

MML, 74mode, 378, 474model

AR, 472ARMA, 120

ARMAX, 120ARX, 473change in th e mean, 471change in the variance, 472FIR, 472linear regression, 115, 472linear sta te space, 473MIMO, 121multi-, 474non-linear parametric, 473OE, 120SISO, 121

two-filter algorithm, 358



498 Index

st at e space with additive hanges,473

model differences, 209

model structure selection, 382model validation, 77, 205modulus restoral, 165Monte Carlo, 429most probable branch, 386motion model, 16MTD, 29, 440MTFA, 28, 440multi-input multi-output, 121

multi-rate signal processing, 279multi-step algorithm, 143multiple model pruning algorithm,

MV, 458

navigation, 6near-field scattering, 117Newton-Raphson algorithm, 127

Neyman-Pearson Lemma, 349NLMS, 136noise cancelation, 115, 167non-causal filter, 427non-causal Wiener filter, 461non-parametric approach, 58non-stationary signal, 5Normalized LMS, 136nuisance, 70

observability, 297observer, 15, 287, 406observer companion form, 272Ockam’s razor, 124Ockham’s razor, 93Octave, ixOE, 120optimal segmentation, 256optimal simulation, 143ordered statistics, 105outlier , 176, 227

386

outliers, 296, 383Output Error, 120

parameter covarianceasymptotic, 199parameter tracking, 5parity equation, 405parity space, 407Parseval’s formula, 160parsimonious principle, 93, 124peak error, 430penalty term, 90, 93, 124, 232, 241

PLS, 125point mass filter, 73Predictive Least Squares, 125primary residual, 215projection theorem, 282, 452prune, 385pruning, 379pseudo-linear regression, 119

QR factorization, 301QR-factorization, 189quasi-score, 215

radar, 329random walk, 62, 142Rauch-Tung-Striebel formulas, 293Rayleigh distribution, 117Recursive Least Squares, 138, 191recursive least squares, 11 61recursive maximum likelihood, 130recursive parameter segmentation,

recursive signal segmentation, 94reduced sta te space estimation, 390regularization, 139resampling techniques, 432residual, 404residual generation, 18residual structure, 405Rice distribution, 117

244



Index 499

RLS, 61, 138, 191algorithm, 192windowed, 193

RMSE, 429robustness, 406Root Mean Square Error, 429run test, 66

Sato, 165Schwartz criterion, 125, 241segmentation, 29, 58, 89, 92, 94,

231, 244, 381, 389

selected availability, 341self-tuning, 137sensitivity, 406sensitivity analysis, 299SER, 155shortest route problem, 162sign data algorithm, 136sign test, 105sign-error algorithm, 136sign-sign algorithm, 136signal

airbag, 32aircraft altitude, 36, 107aircraft dynamics, 419belching sheep, 45, 227D C motor, 42,49 , 173, 327,417earthquake, 40

econometrics, 35EKG, 48, 247electronic nose, 52friction, 12, 21, 23, 27, 175fuel consumption, 9, 20, 26, 81human EEG, 39, 173NMT sales, 132nose, 132paper refinery, 33, 82path, 46, 249photon emission, 34, 106rat EEG, 36, 39, 108, 123, 227

speech, 42, 248target tracking, 328telephone sales figures, 53

tracking, 15, 21, 28valve stiction, 50

signal estimation, 5signal processing, 59Signal-to-Noise Ratio, 155Single-Input Single-Output, 12Singular Value Decomposition,

sliding window, 59, 205

smoothing, 68, 360, 427SNR, 155spectral analysis, 140spectra l factorization, 290, 463specular multi-path, 117spread of the mean, 387, 399square root, 301state estimation, 5state feedback, 6state space, 14st at e space model, 233, 473

algebraic methods, 404state space partitioning, 390static friction, 50

SISO, 121

1407

steepest descent algorithm, 126step size, 62step-size, 134

stiction, 50Stirling’s formula, 85, 225, 259stochastic gradient algorithm, 62,

126, 137stochastic least squares, 122stopping rule, 19, 63, 442super formula, 289surveillance, 5, 58SVD, 407switch, 218Symbol Error Rate, 155system identification, 114, 128



5 Index

target tracking, 6time update, 385time-invariant filter, 427

toolbox, ixtracking error, 429training sequence, 155transient error, 144trellis diagram, 162two-filter smoothing formula, 293

unknown input observer, 277

variance change, 75variance error, 144, 429Viterbi algorithm, 161Viterbi equalizer, 160voting, 406

Wiener-Hopf equations, 461Wiener-Hopf equation, 122Windowed Least Squares, 61, 140windowing, 59Wishart distr ibution , 85, 258WLS, 140

z-transform, 462zero forcing equalizer, 159



Part I: Introduction





Extended summary1.1. About t he book . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2. Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3. Background nowledge . . . . . . . . . . . . . . . . . . . . 6

1.1.4. Outline ndeading dvice . . . . . . . . . . . . . . . . . 71.2. Adaptive inear iltering . . . . . . . . . . . . . . . . . . 8

1.2.1. Signal stimation . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.2. Parameter estima tion using adaptive filtering . . . . . . . 11

1.2.3. St at e estimation using Kalman filtering . . . . . . . . . . 131.3. Change etection . . . . . . . . . . . . . . . . . . . . . . 1 7

1.3.1. Filters s esidual enerators . . . . . . . . . . . . . . . . 17

1.3.2. Stopping ules . . . . . . . . . . . . . . . . . . . . . . . . 181.3.3. One-model pproach . . . . . . . . . . . . . . . . . . . . . 19

1.3.4. Two-model pproach . . . . . . . . . . . . . . . . . . . . . 22

1.3.5. Multi-model pproach . . . . . . . . . . . . . . . . . . . . 231.4. Evaluation and formal design . . . . . . . . . . . . . . . 26

1.4.1. General onsiderations . . . . . . . . . . . . . . . . . . . . 26

1.4.2. Performance measures . . . . . . . . . . . . . . . . . . . . 28

1.1. About he book1 .1 .1. Outlook

The areas of adaptive filtering and change (fau lt) detection are quite activefields. both in research and applications . Some cen tral keywords of t he bookare listed in Table 1.1, and the figures. illustrated in Figure 1.1, give an ideaof the relative activity in the different areas . For comparison. the two relatedand well established areas of adaptive control and system identification areincluded in the table . Such a search gives a quick idea of the size of theareas. but there are of course many shortcomings. and the comparison maybe unfair at several instances . Still. it is interesting to see that th e theoryhas reached many successful applications. which s directly reflected in th e





4 Extended summary

Table 1.1. Keywords and number of hits (March 2000) in different databases . For Sci-enceDirect the maximum number of hits is limited to 2000. On some of t he rows, th e logical‘or’ is used for relat ed keywords like ‘adapt ive signal processing or adaptive estimat ion or

adaptive filter’.

KeywordAdaptive filter/estimation/SPKalman filterAdaptive equalizer (Eq)Target trackingFault diagnosis (FDI)Adaptive control (AC)(System) Identification (SI)Total number of items

IEL ScienceDirect IBM patent4661 952 8711921 1642 317

479 74 291890 124 402

2413 417 744563 2000 6668894 2000 317

588683 856692 2582588

IEL ScienceDirect IBMpatentsK a l m a n K a l m a n

A CC

Figure 1 l Relative frequency of keywords in different databases.

number of patents. Browsing th e titles also indicates t ha t many journa l andconference publications concern applications. Figure 1.2 reveals the, perhapswell known, fact th at th e communication industry is more keen to hold patents

(here: equalization). Algorithms aimed at real-time implementation are also,of course, more often subject to pa tents, compared to , for instance, systemidentification, which is a part of the design process.

Table 1.2 lists a few books in these areas. It is not meant to be compre-hensive, only to show a few important monographs in th e respective areas.

1.1.2. Aim

The aim of the book is to provide theory, algorithms and applications f adap-tive filters with or without support from change detection algorithms. Appli-cations in these areas can be divided into t he th e following categories:



1 l About the book 5

Patents per publicationSI

Adaptive AC

Figure 1.2. Relative ratio of number of p atents found in th e IBM data base compare d topublications in IEL for different keywords.

KeywordAdaptive filters

Kalman filter

Adaptive equalizer

Target tracking

Fault diagnosis

Adaptive controlSystem identification

Table 1.2. Related books.

BooksHaykin (1996), Mulgrew and Cowan (1988), Widrow andStearns (1985), Cowan and Gran t (1985)Kailath et al. (1998), Minkler and Minkler (1990), Andersonand Moore (1979), Brown and Hwang (1997), Chui and Chen

Proakis (1995), Haykin (1994), Gardner (1993), Mulgrew andCowan (1988)Bar-Shalom and Fortmann (1988), Bar-Shalom and Li (1993),Blackman (1986)Basseville and Nikiforov (1993), Gertler (1998), Chen andPatton (1999), Mangoubi (1998)Wstrom and Wittenmark (1989), Goodwin and Sin (1984)Ljung (1999), Soders trom an d Stoica (1989), Johansson (1993)

(1987)7

Surveillance and parameter racking. Classical surveillance problemsconsist in filtering noisy measurements of physical variables as lows, tem-peratures, pressures etc, which will be called signal estimation. Model-based approaches, where (time-varying) parameters in a model of a non-stationa ry signal need to be estimated, is a problem of parameter track-ing. Adaptive control belongs to his area. Another example is blindequalization in digital communication.

State est imation. The Kalman filter provides the best inear sta te es-timate, and change detection support can be used to speed up the re-



6 Extended summary

sponse after disturbances and ab rupt sta te changes. Feedback controlusing state feedback, such as Linear Qua dra tic Gaussian L Q G control,belongs to this area. Navigation and target tracking are two particular

application examples.

Fault detection. Faults can occur in almost all systems. Change detec-tion here has the role of locating the fault occurrence in ime and to ivea quick alarm. After th e alarm, isolation is often needed t o locate thefaulty component. The combined task of detect ion and isolat ion is com-monly referred to as diagnosis. Fault detection can be recast to one ofparameter or state estimation. Faults in actua tors and sensors are mosteasily detected in a sta te space context, while system dynamic changes

often require parametric models.These problems are usually trea ted separately in literature in the areas of sig-nal processing, mathematical statistics, automatic ontrol, communication ys-tems and quality control. However, the tools for solving these problems havemuch in common , and the same type of algorithms can be used (C.R. John-son, 1995). Th e close links between these areas are clearly under-estimated inliterature.

The main difference of the problem areas above lies in th e evaluation cri-teria. In surveillance the parameter estimate should be as close as possibleto the t rue value, while in fault detection it is essential to get an alarm fromthe change detector as soon as possible afte r the fault, and a t the same timegenerating few false alarms. In fault detection, isolat ion of the fault is also amain task . The combination of fault detection and isolation is often abbre-viated to F D I , and the combined task can be referred to as diagnosis. Moreterminology used in this area is found in Appendix B.

Th e design usually consists of the following steps:

1. Modeling the signal or system.

2. Implementing an algorithm.

3. Tun ing the algorithm with respect to certain evaluation criteria, eitherusing real or simulated data.

The main focus is on algorithms and their properties, implementation, tuningand evaluation. Modeling is covered only briefly, but the numerous examplesshould give an idea of the possibilities of model-based signal processing.

1 l3.

Background knowledgeTh e derivations and analysis can be divided into he following areas, and someprior knowledge, or at least orientation, of these is required:



1 l About the book 7

Statistical theory: maximum likelihood, conditional distributions etc.

Calculus: integrations, differentiations, equation solving etc.

Mat rix algebra: projections, subspaces, matrix factorizations etc.

Signal modeling: transfer functions and state space models.

Classical filter theory: the use of a low-pass filter for signal conditioning,poles and zeros etc. Transforms and frequency domain interpretationsoccur, but are relatively rare.

To use the methods, t is essential to understand the model and the statisticalapproach. These are explained n each chapter in a section called ‘Basics’.

These sections should provide enough information for understanding and tun-ing the algorithms. A deeper understanding requires the reader to go throughthe calculus and matrix algebra in th e derivations. The practitioner who ismainly interested in what kind of problems can be addressed is advised tosta rt with th e examples and applications sections.

1 l 4. Outline and reading advi ce

There are certain shor tcu ts to approaching the book, and advice on how to

read the book is appropriate. Chapte r 1 is a summary and overview of th ebook, while Chapter 2 overviews possible applications and reviews the ba-sic mathematical signal models. These first two chapters should serve as anoverview of the field, suitable for those who want t o know what can be donerather than how it is done. Chapters 3, 5 and 8 th e first chapter in eachpart ~ are the core chapters of the book, where standa rd approaches to adap-tive filtering are detailed. These can be used independently of the rest of thematerial. The other chapters star t with a section called ‘Basics’, which canalso be considered as essential knowledge. P art V is a somewhat abstract pre-

sentation of filter theory in general, without using explicit signal models. It isadvisable to check the content at an early stage, bu t the reader should in noway spend too much time trying to digest all of the details. Instead, browsethrough and return to the details later. However, th e ideas should be familiarbefore starting with the other parts . The mate rial can be used as follows:

Chapters 1 and 2 are suitable for people from within industry who wantan orientation in what adaptive filtering is, and what change detectioncan add to performance. An important goal is to understand what kind

of practical problems can be solved.Chapters 5, 8 and 13 are suitable for an undergraduate course in adaptivefiltering.



1 . 2 AdaDtiveinear 9

1.2.1. Signal estimation

The basic signal estimation problem is to estimate the signal part t in thenoisy measurement yt in the model

An example of an adaptive algorithm is

Here At will be referred to as the forgetting fuctor. It is a design parameterth at affects the tracking speed of th e algorithm. As will become clear fromthe examples to follow, it is a trade-off between tracking speed and noiseattenuation. The archetypical example is to use At = X, when this also hasthe interpretation of the pole in a first order low-pass filter. More generally,any (low-pass) filter can be used. If it is known th at th e signal level hasundergone an abrupt change, as might be indicated by a change detectionalgorithm, then there is a possibility to momentarily forget all old informationby setting A t = 0 once. This is an example of decision feedback in an adaptive

filter, which will play an importa nt role in change detection. An illustrativesurveillance problem is given below.

Example 7 7 Fuel consumption

Th e following application illustrates th e use of change detection for im-proving signal quality. Th e da ta consist of measurements of instantaneousfuel consumption available from the electronic injection system in a Volvo 850GLT used as a test car. Th e raw d at a are pulse lengths of a binary signal,

called t,, which is the control signal from the electronic injection system tothe cylinders. When t , = 1, fuel is injected with roughly constant flow, so thelength of the t , pulses is a measure of fuel consumption. The measured signalcontains a lot of measurement noise and needs some kind of filtering beforebeing displayed to the driver on the dashboard. Intuitively, the actual fuelconsumption cannot change arbitrarily fast, and the measured signal must besmoothed by a filter. There are two requirements on the filter:

Good attenuat ion of noise is necessary to be able to tu ne th e accelerator

during cruising.Good racking ability. Tests show th at fuel consumption very oftenchanges abruptly , especially in city traffic.



10 Extended summary

25e a s u r e m e n t

low filterI

Ti m e [S]

Figure 1.3. Measurements of fuel consumption and two candidate filters. Dat a collected bythe author in a collaboration with Volvo.

These equirements are contradictory for stan dard linear filters. The hinlines in Figure 1.3 show measurements of fuel consumption for a test in city

traffic. Th e solid lines show the result of 1.2) for two particular values ofthe forgetting factor X Th e fast filter follows the abrupt changes well, butatte nua tes the noise unsatisfactorily, and it is the other way around for theslow filter. The best compromise is probably somewhere in between thesefilters.

The fundamental trade-off between speed and accuracy is inherent in alllinear filters. Change de tectors provide a tool t o design non-linear filters with

bet ter performance for the type of abruptly changing signal in Figure 1.3.

Figure 1.4 shows the raw data, together with a filter implemented by Volvo(not exactly the same filter, bu t the principal functionality is the same). Volvouses a quite fast low-pass filter to get good tracking ability and then quant izesthe result to a multiple of 0.3 to at tenuate ome of th e noise. To avoid a rapidlychanging value in the monitor, they update the monitored estimate only oncea second. However, the quantization introduces a problem when trying tominimize fuel consumption manually, an d the response time to changes of onesecond makes the feedback information to th e driver less useful.




25

olvo s filterM e a s u re me n t

20 -

20 40 6 80 101Ti m e [S]

Figure 1.4. Measured fuel consumption and a filtered signal similar to Volvo's implementedfilter.

1.2.2. Parameter estimation using adaptive filtering

A quite general parametric model of a linear system is

Here G(q; ) and H ( q ; ) are two filters expressed in th e delay operator qdefined by qut = ut l. The parameters I in the model are assumed to betime-varying, and are to be estimated recursively. Here and in t he sequel ut

denotes measured inputs to th e system, if available, and et is an unknowninput supposed to be modeled as a stochastic noise disturbance.

Th e generic form of an adapt ive algorithm is

e t , , = e t +K t E t ,

E t = Yt -?to

The outpu t rom the estimated model is compared to th e ystem, which definesthe residual E t . The adaptive filter acts as a system inverse, as depicted inFigure 1.5. One common filter ha t does this operation for linear in parame termodels is the recursive least squares (RLS) filter. Other well known filters areLeast Mean Square LMS) (unnormalized or normalized), the Kalman i l terand least squares over sliding window. This book focuses on parametric modelsth at are linear in the parame ters (no t ecessarily linear in the measurements).The reason for this is that th e statistical approaches become optimal in thiscase. How to obtain sub-optimal algorithms for th e general linear filter modelwill be discussed.



12 Extended summarv

Noise et o u t p u t yt Yt E t M et

Input ut 8 6 M Btd. filterarameter B t utystemW W

W W

Figure 1.5. Th e interplay between t he parameter estim ator and the system.

Example 7 2 Friction estimation

This example introduces a case study that will be studied in Section 5.10.3.It has been noted (Dieckmann, 1992) th at th e dependence between the so-called wheel slip and trac tion force is related t o friction. Th e slip is defined asthe relative difference of a driven wheel’s circumferential velocity, w r andits absolute velocity, v :

S =wr v

V ,7

where r , is th e wheel radius. We also define th e normalized traction force, p(sometimes referred t o as the friction coefficient) as the ra tio of traction force

F f )at the wheel and normal force N ) on one driven wheel,

Ffp = - .N

Here p is computed from measurements, in this case a fuel injection signal,via an engine model. In th e sequel we will consider S and p as measurements.

Define the slip slope k as

The hypothesis is that k is related to friction, and the problem is to adaptivelyestimate it.

The goal is to derive a linear regression model, i.e. a model linear in theparameters. The slip slope k we want to compute is defined in (1.7), whichfor small p reads (including an offset term S

p = k ( s S ,

where also S is unknown. Although this is a model linear in th e parameters,

there are two good reasons for rewriting it as1

k= p - S .




That is, we consider S to be a function of p rather than the other way around.The reasons are th at S contains more noise than pm and that the parameterS is varying much slower as compared t o kS. Both these arguments facilitate

a successful filter design. Note th at all quant ities in th e basic equation (1.9)are dimensionless.

We will apply a filter where 1 /k and S are estimated simultaneously. Th edesign goals are

to get accurate values on k while keeping th e possibility to track slowvariations in both k and S, and at the same time

to detect abrupt changes in k rapidly.

This case study will be approached by a Kalman filter supplemented by achange detection algorithm, detailed in Section 5.10.3.

A notationally simplified model that will be used is given by

where 54 = Oil),0 i 2 ) ) Tis the parameter vector consisting of inverse slopeand the offset, ut is the input to the model (the engine torque) and yt is themeasu red outpu t (the wheel slip).

An example of measurements of yt, ut from a test drive is shown in Figure1.6(a). The car enters a low friction area at sample 200 where the slope pa-

rameter Oil)increases. Figure 1.6(b) shows a scatter plot of the measurementstogether with a least squares fit of straight lines, according o th e model 1.10,to the measurements before and after a friction change.

Two adaptive filters were applied, one fast and one slow. Figure 1.7 il-

lustrates the basic limitations of linear filters: the slow filter gives a stableestimate, but is too slow in tracking th e friction change, while the fast filtergives an ou tput that is too noisy for monitoring purposes. Even th e fast filterhas problem in tracking the friction change in this example. he reason is poorexcitation, as can be seen in Figure 1.6(b). There is only a small variation inut after the change (circles), compared to th e first 200 samples. A thoroughdiscussion on this matter is found in Section 5.10.3.

1.2.3. State estimation using Kalman filteringFor state estimation, we essentially run a model in parallel with th e systemand compute the residual as the difference. Th e model is specified in s t a t e



1 4 Extended summary

0.015 0.35

0.01 0.3-U1

0.005 - 8 0 . 2 5e

; l 1o 2

g

e0.4

Sg 0.3

e 0.2

0.1E4 0 0 50 10050 200 250 300 350

Time [samples]0 0.005 0.01.015

Slip0.02

Figure 1.6. Measurements of yt and ut from a test drive going from dry asphalt to ice (a).A scatter plot (b) reveals the stra ight line friction model. Th e da ta were collected by th eautho r in collaboration with Volvo.

0.05

0.045s t imated parameters , slow filters t imated param eters , as t f i lte r

True pa ra me te r s

0.04 - - - - - - - - - - - - -

50 100 1500050 3Time [ s a mple s ]

Figure 1.7. Estimated slope and offset as a function of sample number.

space form:

1.11)

Here yt is a measured signal, A, B , C, are known matrices and xt is an un-known vector (the stat e). There are three inputs to t he ystem: the observable(and controllable) ut, the non-observable process noise an d the measurementnoise et .




Noise e t ,wt Output Y tt E t M et

Input ut 5 M ztalman filtertate zt utystemc c

c c

Figure 1.8. Th e interplay between the state estimator and the system.

In a statistical setting, state estimation is achieved via a Kalman filter,while in adeterministic setting, where th e noises are neglected, th e samedevice is called an observer. A generic form of a Kalman filter and an observeris

1.12)

where Kt is the filter gain specifying the algorithm. The Kalman ilter providesthe equations for mapping the model onto the gain as a function

The noise covariance matrices Q, R are generally considered to have somedegrees of freedom th at act as the design parameters. Figure 1.8 illustrateshow the Kalman filter aims at ‘inverting’ the system.

Example 1.3 Target tracking

Th e typical application of target tracking is to estima te the position of air-craft. A civil application is air traffic control (ATC) where the traffic surveil-

lance system a t each airport wants to get t he position and predicted positionsof all aircraft at a certain distance from he airport. There are plenty ofmilitary applications where, for several reasons, the position and predictedposition of hostile aircraft are crucial information. There are also a few otherapplications where the target is not an aircraft.

As an illustration of Kalman filtering, consider the problem of target track-ing where a radar measures range R and bearing 8 o an aircraft. Th e mea-surements are shown in Figure 1.9 as circles. These are noisy an d the purpose

of the Kalman filter is firstly to a tt en ua te t he noise an d secondly to predictfuture positions of the aircraft. This application is described in Section 8.12.2.

One possible model for these tracking applications is th e following st at e



1 6 Extended summary

X 104

3.5

3

2.5

2

0 1 2 3 4 5X

X 104

( 4

104 Initial filter design

3.5

3

2.5

2

> 1.5

1

0.5

0

-0.50 1 2 3 4 5

X 104

Figure 1.9. Radar measurements (a) and estimates from a Kalman filter (b) for an aircraftin-flight manoeuvre.

space model

/ l 0 T 0)

xt+T = 1 0 0 1 O I X t t l T W t 1.14)

\o 0 0 1 )Yt = + et .

Th e state vector used here is X = ( ~ 1 ~ ~ 2 , 7 1 1 , 7 1 2 ) ~ ~here x is the positionin 2D, z th e corresponding velocity. Th e st at e equation is one example ofa motion model describing th e dynamics of the object t o be tracked. Moreexamples are given in Chapter 8.

Th e measurements are transformed o this coordinate system and Kalmanfilter is applied. Th e resulting estimates are marked with stars. We can notethat th e tracking is poor, and as will be demonstrated, the ilter can be bettertuned.

To summarize what has been said, conventional adaptive filters have th efollowing well-known shortcoming:

Fundamental limitation of linear adaptive filtersThe adap tation gain in a linear adaptive filter is a compromise

between noise attenu ation and tracking ability.



1.3 Change detection 17

1.3. Change detection

Algorithmically, all proposed change detectors can be put nto one of the

following three categories:

Methods using one filter, where a whiteness test is applied t o th e resid-uals from a linear filter.

Methods using two filters, one slow and one fast one, in parallel.

Methods using multiple filters in parallel, each one matched t o certainassumption on the abrupt changes.

In the following subsections, these will be briefly described. Let us note t hatthe computational complexity of the algorithm is proportional to how manyfilters are used. Before reviewing these methods, we first need to define whatis meant by a residual in this context, and we also need a tool for decidingwhether a result is significant or not a stopping rule.

1.3.1. Filters as residual generators

A good understanding of the Kalman and adaptive ilters requires a thorough

reading of Chapters 5 and 8. However, as a shortcut to understanding sta-tistical change detection, we only need to know the following property, alsoillustrated in Figure 1.10.

Residual generationUnder certain model assumptions, th e Kalman and adapt ive ilterstake the measured signals and transfo rm them to a sequence ofresiduals that resemble white noise before the change occurs.

From a change detection point of view, it does not ma tter which filter weuse and the modeling phase can be seen as a standa rd task . The filters alsocomputes other statistics that are used by some change detectors, but moreon this later.

4 Filter

Figure 1.10. A whitening filter takes the observed input ut and output yt and transformsthem to a sequence of residuals E t .



18 Extended summary

In a perfect world, th e residuals would be zero before a change and non-zeroafterwards. Since measurement noise and process disturbances are fundamen-tal problems in the sta tistical approach to change detection, the actual value

of the residuals cannot be predicted. Instead, we have to rely on their averagebehavior.

If there is no change in the system, and the model is correct, then theresiduals are so-called white noise, that is a sequence of independent stochasticvariables with zero mean and known variance.

After the change either the mean or variance or both changes, that is, theresiduals become ‘large’ in some sense. Th e main problem in statistical changedetection is to decide what ‘large’ is.

Chapter 11 reviews how state space models can be used for filtering (or

residual generation as it will be referred to in this context). The idea is tofind a set of residuals tha t is sensitive to th e faults, such th at a particularfault will excite different combinations of th e residuals. The main approachtaken in th at chapter is based on parity spaces. Th e first step is to stack allvariables into vectors. Th e linear signal model can then be expressed as

where Yt is a vector of outputs, Ut is a vector of inputs, Dt th e disturbances

and Ft the faults. The residual is then defined as a projection

With proper design of W , he residual will react t o certain faults in specificpatters, making fault isolation possible. A simple example is when t he mea-surement is two-dimensional, and the state dis turb ance and the ault are bothscalar functions of time. Then, under certain conditions, it is possible to lin-early transform the measurement to ~t = d t , t ) T. A projection that keepsonly the second component can now be used as the residual to detect faults,and it is said that the disturbance is decoupled.

It should also be noted th at th e residual is not the only indicator of achange (that is, it is not a sufficient statistic) in all cases. So even thoughresidual based change detection as outlined below is applicable in many cases,there might be improved algorithms. Th e simplified presentation in this chap-ter hides the fact th at th e multi-model approaches below actually use otherstatistics, but the residual still plays a very important role.

1.3.2. Stopping rulesMany change detection algorithms, among these algorithms in th e classes ofone-model and two-model approaches below, can be recast into th e problem



1.3 Chanae detection 19

of deciding on the following two hypotheses:

H0 : E(st) = 0,

H1 :E(s t ) > 0.A stopping rule is essentially achieved by low-pass filtering st and comparingthis value to a threshold. Below, two such low-pass filters are given:

The Cumulative SUM (CUSUM) test of Page (1954):

gt = max(gt-1 + st - v , 0), alarm if gt > h.

The drift parameter U influences the low-pass effect, and the threshold h

(and alsov )

nfluences the performance of th e detector.Th e Geometric Moving Average (GMA) test in Roberts (1959)

gt = Xgt-1 + 1 - X)st, alarm if gt > h.

Here, the forgetting factor X is used to tu ne th e ow-pass effect, and thethreshold h is used to tune the performance of th e detector. Using noforgetting at all (X = 0), corresponds to thresholding directly, which isone option.

1.3.3. One-model approach

Statistical whiteness tests can be used to test if th e residuals are white noiseas they should be if there is no change.

Figure 1.11 shows the basic structure, where the filter residuals are trans-formed to a distance measure, th at measures the deviation from the no-changehypothesis. The stopping rule decides whether the deviation is significant ornot . The most natural distance measures are listed below:

Change in the mean. The residual itself is used in th e stopping rule andS t = E t .

Change in variance. The squared esidual subt racted by a known residualvariance X is used and st = E; - X.

Datac Filter Stopping ruleistance meas.

E t S t Alarmc

Yt, U t k a

Figure 1.11. Change detection based on a whiteness test from filter residuals.



20 Extended summary

Change in correlation. Th e correlation between th e residual and pastoutputs and/or inputs are used and st = Etyt-k or st = ~ t u t - k for somek .

Change in sign correlation. For instance , one can use the fact that whiteresiduals should change sign every second sample in the average and uses t = sign(etet-l). A variant of this sign test is given in Section 4.5.3.

Example 7.4 Fuel consumption

To improve on the filter in Example 1.1, the CUSUM test is applied to theresiduals of a slow filter, like that in Figure 1.3. For the design parametersh = 5 and v = 0.5, the response in Figure 1.12 is obtained. The vertical linesillustrate the alarm imes of the CUSUM algorithm. The lower plot shows howthe test stat istic exceeds th e threshold level h at each alarm. The adaptivefilter in this example computes he mean of the signal from he latest alarm tothe current time. With a bad tuning of the CUSUM algorithm, we get eitherthe t otal mean of the signal if there are no alarms at all, or we get the signalback as the estimate if the CUSUM test gives an alarm at each time instant.These are the two extreme points in th e design. Note tha t nothing worse can

happen, so the stab ility of the filter is not an issue here. To avoid th e firstsituation where the estimate will converge to th e overall signal mean, a better

30

Eg 2

3 10

z

U

00 100 200 300 400 500

~~

Time [samples]

Figure 1.12. Response of an adap tive filter rest arted each time a CUSUM test ( h = 5 ,v = 0.5), fed wi th the filter residuals, gives an alar m. The lower plot shows the test stat isti cof the CUSUM test.




design is to use a slow adaptive filter of the type illustrated in Figure 1.3. Toavoid the second degenerate case, an alarm can trigger a fast filter instead ofa complete filter rest art. That is, th e algorithm alternates between slow and

fast forgetting instead of complete or no forgetting.

In contrast to the example above, the next example shows a case wherethe user gets importan t information from the change detector itself.

Example 1 5 friction estimation

Figure 1.13 shows how a whiteness test, used for restarting the filter, canimprove the filtering result in Example 1.2 quite considerably.

Here the CUSUM test statistics from the residuals is used, and the teststatistics gt are shown in the lower plot in Figure 1.13. Note tha t th e teststat istics start to grow at time 200, bu t tha t the conservative threshold levelof 3 is not reached until time 210.

0.04 - - True pa ra me te r sEs t ima te d pa ra me te r s

II

v100 15000 250 300

4 1 4I

0 50 100 150 200 250 300Time [ s a mple s ]

Figure 1 l 3. Estimated friction parameters (upper plot) and test statistics rom the CUSUMtest (lower plot). Note the improved tracking ability compared to Figure 1.7.

In the following example, change detection is a tool for improved tracking,and the changes themselves do not contain much information for th e user.

Example 1 6 Target trackingThis example is a continuation of Example 1.3, an d the application of th e

CUSUM test is analogous to Example 1.5. Figure 1.14(a) shows the estimated



22 Extended summary

X 10

O tF + Measured position

- * Estimated position-0.5

2 3 4 5X 10

(a) (b)

Figure 1.14. Radar measurements and estimates from a Kalman filter with feedback froma CUSUM detector (a), giving alarms at samples 19 and 39. Th e right plot (b) shows th eCUSUM test statistics and the alarm level.

position compared to the tru e one, while Figure 1.14(b) shows the test statis-tics from the CUSUM test. We get an a larm in the beginning telling us th atthe position prediction errors are much larger th an expected. After filter con-vergence, there are alarms at samples 16, 39 and 64, respectively. These occur

in the manoeuvres and the filter gain is increased momentarily, implying thatthe estimates come back to th e right track almost directly after the alarms.

1.3.4. Two-model approach

A model based on recent data only is compared to a model based on datafrom a much larger da ta window. By recent dat a we often mean da ta from asliding window. The basic procedure is illustrated in (1.15) and Figure 1.15.

Model M1

Data : 51,Yz, . . . , Yt-L,Yt-L+l,. . . , (1.15).,

Model M

The model ( M2 ) is based on data from a sliding window of size L and iscompared to a model M l )based on all data or a substantially larger slidingwindow. If the model based on the larger data window gives larger residuals,

then a change is detected. The problem here is t o choose a norm that corre-sponds to a relevant stat istical measure. Some norms th a t have been proposedare:



24 Extended summary

100 150 200 250 300Ti m e [ s a m p l e s ]r

OO 50 100 150 200 250 300

Ti m e [ s a m p l e s ]

Figure 1.16. The residual ratio I E ~ ~ ) ~ / C X+ )1) from th e fast and slow filters in Example1.2. In the lower plot, the ratio is low-pass filtered before the magnitude is plotted.

to this knowledge so th at it gives white residuals even afte r the change(s). Suchfilters are usually called matched filters, because they are matched to specific

assumptions on the tru e system.

Th e idea in th e multi-model approach is to enumerate all conceivable hy-potheses about change times and compare the residuals from their matchedfilters. Th e filter with the ‘smallest’ residuals wins, and gives an estimate ofthe change times th at usually is quite accurate. The setup s depicted in Figure1 17 The formulation is in a sense off-line, since a batch of data is considered,but many proposed algorithms tu rn ou t to process da ta recursively, and theyare consequently on-line.

Again, we must decide what is meant by ‘small’ residuals. By just tak ingthe norm of the residuals, it t urns ou t that we can make the residuals smallerand smaller by increasing the number of hypothesized change times. Tha t is,a penalty on the number of changes must be builtin.

There can be a change or no change at each time instan t, so there are2t possible matched filters at time t . Thus, it is often infeasible to apply allpossible matched filters to the da ta . Much effort has been spent in developingintelligent search schemes that only keep a constant number of filters at each

time.To summarize, the fundamental trade-off inherent in all change detection

algorithms is the following:




+-- d. filter E i N ) I

Figure 1.17. A bank of matched filters, each one based on a particular assumption on theset of change times R = { k i } ~ = L = l ,hat are compared in a hypothesis test.

Fundamental limitation of change detectionThe design is a compromise between detecting true changes andavoiding false alarms. I

On the other hand , adaptive ilters complemented with a change detectionalgorithms, as depicted in Figure 1.18 for the whiteness test principle, offert h(following possibility:

Fundamental limitation ofchange detection for adaptive filtering

The design can be seen as choosing one slow nominal filter andone fast filter used after detections. Th e design is a compromisebetween how often the fast filter should be used. A bad designgives either the slow filter (no alarms from the detec tor) or thefast filter (many false ala rms) , and tha t s basically th e worst th a t

can happen.



26 Extended summarv

et or x tt c

c

Yt Filter E t c Detector c Alarmc

tFigure 1.18. Basic feedback structure for adaptive filtering with a change detector. Thelinear filter switches between a slow and a fast mode depending on the alarm s ignal.

1.4. Evaluation and ormal design

1.4.1. General considerations

Evaluation of the design can be done either on simulated ata or on real data,

where the change times or true pa rameters are known. The re are completelydifferent evaluation criteria for surveillance and fault detection:

For signal, parameter or state estimat ion, the main performance mea-sures are tracking ability and variance error in the estimated quantities.For linear adaptive filters there is an inherent trade-off, and the tun-ing consists in getting a fair compromise between these two measures.Change detectors are non-linear functions of dat a and it is in principlepossible to get arbitrarily fast tracking and small variance error etweenthe estimated change times, but the change time estimation introducesanother kind of variance error. Measures for evaluating change detectorsfor estimation purposes include Root Mean Square (RMS) parameterestimation error in simulation studies and information measures of theobtained model which work bo th for simulated and real data .

For fault detection, it is usually importa nt to get the ala rms as soonas possible ~ the delay for detection ~ while the number of false alarmsshould be small. The compromise inherent in all change detectors con-sists in simultaneously minimizing th e false alarm rate and delay fordetection. Other measures of interest are change time estimation ac-curacy, detectability (which faults can be detected) and probability ofdetection.

The final choice of algorithm should be based on a careful design, taking theseconsiderations into account. These mat ters a re discussed in Chapter 13.

Th e first example is used to illustrate how the design parameters influencethe performance measures.

Example 1 8 Fuel consumptionTh e interplay of th e two design parame ters in the CUSUM test is compli-

cated, and there is no obvious simple way to app roach the design in Example



1 . 4 Evaluation and formalesign 27

Figure 1.19. Examination of th e influence of the CUSUM test’s design parameters h andon the number of alarms (a) and the sum of squared residuals (b).

1.4 th an to se experience and visual examination. If the true uel consumptionwas known, formal optimization based design can be used. One examinationtha t can be done is t o compare th e number of ala rms for different design pa-rameters, as done in Figure l. l9 (a ). We can see from th e signal th a t 10-15alarms seem to be plausible. We can then take any f th e combinations givingthis number as candidate designs. Again visual inspection is th e best way toproceed. A formal attempt is t o plot the sum of squared residuals for eachcombination, as shown in Figure l. l9 (b ). Th e residual is in many cases, likethis one, minimized for any design giving false alarms all the time, like h = 0,v = 0, and a good design is given by he ‘knee’, which is a compromise betweenfew alarms and small residuals.

The second example shows a case where joint evaluation of change detec-tion and parameter tracking performance is relevant.

~

Example 1 9 fri ct ion estimation

Consider the change detector n Example 1.5. To evaluate the over-allperformance, we would like to have many signals under t he same premises.These can be simulated by noting that th e residuals are well modeled byGaussian white noise, see the upper plot in Figure 1.20(a) for one example.Using the ‘true’ values of the fric tion parameters, the measured input pt andsimulated measurement noise, we can simulate, say, 50 realizations of S t . Themean performance of th e filter is shown in Figure 1.20(b). This is the relevantplot for validation for the purpose of friction surveillance. However, if an al armdevice for skid is the primary goal, then the delay for detection illustrated in



28 Extended summarv

15 0.05stimated parameters

0.045 - - True parameters10

.g 5

0.04

0.035

0.03-6 -4 -2 0 2 4 6 8

r

Detection time [samp les]; lA0 150 20050 30

Time [samples]

( 4 (b)

Figure 1.20. Illustr ation of residual distribution from real da ta and dis tri bu tio n of ala rmtimes (a). Averaged parameter tracking (b) and alarm times are evaluated from 100 simu-lations with an abrupt friction change at sample 200

th e lower plot of Figure 1.20(a) is th e relevant evaluation criterion. Here onecan make stateme nts like: the mean delay for alarm after hitting an ice spotis less than five samples (1 second)

Finally, the last example illustrates how parameter tracking can be opti-mized.

Example 7 7 0 Target tracking

Let us consider Example 1.3 again. It was remarked tha t the tracking inFigure 1.9 is poor and the filter should be better tuned. The adaptation gaincan be optimized with respect to the RMS error and an optimal adaptationgain is obtained , as illustrated in Figure 1.21(a). Th e optimization criterionhas typically, as here, no local min ima but s non-convex, making minimization

more difficult. Note that t he optimized design in Figure 1.21(b) shows muchbetter tracking compared to Figure 1.9, at the price of somewhat worse noiseattenuation.

1.4.2. Performance measures

A general on-line statistical change detector can be seen as a device th at takesa sequence of observed variables and at each time makes a binary decision ifthe system has undergone a change. Th e following measures are critical:

M ean Ti me be tween False Al a r m s M TF A) .How often do we get alarmswhen th e system has not changed? The reciprocal quantity is called thefa lse a larm ra te F A R ) .



1 . 4 Evaluation and formalesian 29

250 ' ' . , . . . . . I1o3 10 1o5 1 o

Adaptation gain

O: )- 1 2

-e- Measured position-* Estimated position

3 4 5-0.5

X 104

( 4 (b)

Figure 1.21. Total RMS error from the Kalman filter as a function of its gain (a). Notethat there is a clear minimum. Th e optimized design is shown in (b).

Mean Time to Detection M T D ) .How long time do we have to wait aftera change until we get the alarm?

Average Run Length function, ARL(B), which generalizes MTFA andMTD. The ARL function is defined as the mean time between alarmsfrom the change detector as a function of the magnitude of the change.Hence, the MTFA and DFD are special cases of the ARL function. Th eARL function answers the question of how long it takes before we getan alarm after a change of size 8.

In practical situations, either MTFA or MTD is fixed, and we optimize thechoice of method and design parameters to minimize the other one. For in-stance, in an airborne navigation system, the MTFA might be specified t o onefalse alarm per 105 flight hours, and we want to get the alarms as quickly aspossible under these premises.

In off-line applications, we have a batch of dat a an d want t o find the timeinstants for system changes as accurately a s possible. This is usually calledsegmentation. Logical performance measures are:

Estimation accuracy. How accurately can we locate th e change times?

The Minimum Description Length (MDL). How much information is

needed to store a given signal?The latter measure is relevant in d at a compression and communication areas,where disk space or bandwidth is limited. MDL measures the number of



30 Extended summary

binary digits that are needed to represent the signal. Segmentation is one toolfor making this small. For instance, the GSM standard for mobile telephonymodels the signal as autoregressive models over a certain segmentation of the

signal, and then transmi ts only the model parameters and the residuals. Thechange times are fixed to every 50 ms. Th e receiver then recovers th e signalfrom this information.



Applications2.1.Change in the mean model . . . . . . . . . . . . . . . . 32

2.1.1. Airbag ontrol . . . . . . . . . . . . . . . . . . . . . . . . 32

2.1.2. Paper refinery . . . . . . . . . . . . . . . . . . . . . . . . . 332.1.3. Photon emissions . . . . . . . . . . . . . . . . . . . . . . . 342.1.4. Econometrics . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2. Change in th e variance mod el . . . . . . . . . . . . . . . 35

2.2.1. Barometric altitude sensor in aircraft . . . . . . . . . . . . 362.2.2. Rat EG . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3. IRmodel . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3.1. Ash recycling . . . . . . . . . . . . . . . . . . . . . . . . . 372.4.ARmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4.1. R a t E E G . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.4.2. Human EG . . . . . . . . . . . . . . . . . . . . . . . . . 392.4.3. Earthquake analysis . . . . . . . . . . . . . . . . . . . . . 402.4.4. Speech segmentation . . . . . . . . . . . . . . . . . . . . . 42

2.5.ARX model . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5.1. DCmotor ault etection . . . . . . . . . . . . . . . . . . 422.5.2. Belching sheep . . . . . . . . . . . . . . . . . . . . . . . . 45

2.6.Regression model . . . . . . . . . . . . . . . . . . . . . . 46

2.6.1. Path segmentation and navigation in cars . . . . . . . . . 46

2.6.2. StoringEKG signals . . . . . . . . . . . . . . . . . . . . . 482.7.State space model . . . . . . . . . . . . . . . . . . . . . . 492.7.1. DCmotor ault etection . . . . . . . . . . . . . . . . . . 49

2.8.Multiple models . . . . . . . . . . . . . . . . . . . . . . . 49

2.8.1. Valve stiction . . . . . . . . . . . . . . . . . . . . . . . . . 502.9.Parameterized non-linear mod els . . . . . . . . . . . . . 51

2.9.1. Electronic nose . . . . . . . . . . . . . . . . . . . . . . . . 522.9.2. Cell phone sales figures . . . . . . . . . . . . . . . . . . . 53

This chapter provides background information and problem descriptionsof the applications treated in this book . Most of the applications includereal data and many of them are used as case studies examined throughout thebook with different algorithms . This chapter serves both as a eference chapter





32 Atmlications

and as a motivation for the ar ea of adaptive filtering and change detection.The applications are divided here according to th e model structure s that ar eused. See the model summary in Appendix A for further details. The larger

case studies on arget racking, navigation, aircraft control fault detection,equalization and speech coding, which deserve more background information,are not discussed in this chapter.

2.1. Change in the mean model

The fuel consumption application in Examples 1.1 1 4 and 1.8 is one exampleof change in the mean model. Mathematically, th e model is defined in equation(A .l ) in Appendix A). Here

acouple of other examples are given.

2.1. l . Airbag control

Conventional airbags explode when the front of the car is decelerated by acertain amount. In the first generation of airbags, the same pressure was usedin all cases, independently of what th e driver/passenger was doing, or theirweight. In par ticu lar, the passenger might be leaning forwards, or may noteven be present. Th e worst cases are when a baby seat is in use, when a

child is standing in front of the seat, and when very short persons are drivingand sitting close to th e steering wheel. One idea to improve th e system is tomonitor the weight on the seat in order to detect the presence of a passenger

1 5 0 r

-50 I0 5 10 1505 30 35 40 45

Time [S]

Figure 2.1. Two data sets showing a weight measurement on a car seat when a person entersand leaves th e car. Also shown are one on-line an d one off-line estimates of the weight as afunction of time. Data provided by Autoliv, Linkoping, Sweden.



2.1 Chanae in the mean model 33

'I I 18001

1600

1400

1200

10004

g 800

600

400

200

00 ' 2000 4000 6000000 100002000Timesamples]

(a) (b)

Figure 2.2. Power signal from a paper refinery and the output of a filter designed by thecompany. Data provided by Thore Lindgren at Sund's Defibrator AB, Sundsvall, Sweden.

and, in th at case, his position, and to then use two or more different explo-sion pressures, depending on the result. Th e dat a in Figure 2.1 show weightmeasurements when a passenger is entering and sho rtly afterwards leaving theseat. Two different da ta sets are shown. Typically, there are certain oscilla-tions after manoeuvres, where th e seat can be seen as a damper-spring system.As a change detection problem, this is quite simple, but reliable functionalityis still rathe r impor tant. In Figure 2.1, change times from an algorithm inChapter 3 are marked.

2.1.2. Paper efinery

Figure 2.2(a) shows process dat a from a paper refinery (M48 Refiner, Tech-board, Wales; the original data have been rescaled). The refinery engine grindstree fibers for paper production. Th e interesting signal is a raw engine power

signal in kW, which is extremely noisy. It is used to compute the referencevalue in a feedback control system for quality control and also to detect engineoverload. Th e requirements on the power signal filter are:

Th e noise must be considerably attenuated to be useful in th e feedbackloop.

It is very important to quickly detect ab rupt power decreases to be ableto remove the grinding discs quickly and avoid physical disc faults.

That is, both tracking and detection are important, but for two different rea-sons. An adaptive filter provides some useful information, as seen from thelow-pass filtered squared residual in Figure 2.2(b).



34 Atmlications

There are two segments where th e power clearly decreases quickly. Fur-thermore, there is a starting and stopping trans ient that should be de-tected as change times.

Th e noise level is fairly constant (0.05) during the observed interval.

These are the starting points when th e change detector is designed in Section3.6.2.

2.1.3. Photon emissions

Tracking the brightness changes of galactical and extragalactical objects is animportant subject n astronomy. Th e da ta xamined here are obtained from -ray and y-ray observatories. The signal depicted in Figure 2.3 consists of evenintegers representing he time f arrival of th e photon , in units f microseconds,where the fundamental sampling interval of the instrument is 2 microseconds.More details of the application can be found n Scargle (1997). This is atypical queue process where a Poisson process is plausible. A Poisson processcan be easily converted to a change in the mean model by computing thetime difference between the arrival times. By definition, these differences will

be independently exponentially distributed (disregarding quantization errors ).

1

' 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Time [S]

50 100 150 200 250Time [sampels]

Figure 2.3. Photon emissions are counted in short ime bins. The first plot shows thenumber of detected photons as a function of time. To recast the problem t o a change in themean, the time or 100 arrivals is computed, which is shown in the lower plot. Da ta providedby Dr. Jeffrey D. Scargle at NASA.



36 Atmlications

2.2.1. Barometric altitude sensor n aircraft

A barometric air pressure sensor is used in airborne navigation systems forstabilizing the altitude estimate in the inertia l navigation system. Th e baro-metric sensor is not very accurate, and gives measurements of height with b otha bias and large variance error. Th e sensor is particularly sensitive to the socalled transonic passage, th at is, when Mach 1 (speed of sound) is passed.It is a good idea to detect for which velocities the measurements are useful,and perhaps also to tr y to ind a table for mapping velocity to noise variance.Figure 2.5 shows the errors from a calibrated (no bias) barometric sensor, om-pared to the ‘true’ values from a GPS system. The lower plot shows low-passfiltered squared errors. Th e da ta have been rescaled.

It is desirable here to have a procedure to automatically find th e regionswhere th e da ta have an increased er ror , and to tabula te a noise variance asa function of velocity. Wi th such a table at hand, the navigation system canweigh the information accordingly.

15Barometric height residuals

10 I

-10’ I0 500 1000 1500 2000 2500 3000 3500000

151Low-pass filtered squared residuals

Figure 2.5. Barometric altitude measurement error for a test flight when passing the speedof sound (sample around 1600). The lower plot shows low-pass filtered errors as a very roughestimate of noise variance. (Data provided by Dr. Jan Palmqvist, SAAB Aircraft.)

2.2.2. Rat EEG

The EEG signal in Figure 2.6 is measured on a rat . The goal is to classify th esignal into segments of so called ”spindles” or background noise. Currently,researchers are using a narrow band filter, and the n apply a threshold for theoutput power of the filter. That method gives



2.3 FIR model 37

Rat EEG101 I I

-1 0; I500 1000 1500 2000500 3000 3500 4000

Segmentation of signal's variance

0 500 1000 1500 2000500 3000 3500 4000Time [samples]

Figure 2.6. EEG for a rat an d segmented signal variance. Three distinct areas of brainactivity can be distinguished. (D ata provided by Dr. Pasi Karjala inen, Dept. of AppliedPhysics, University of Kuopio, Finland. )

[l096 1543 1887 2265 2980 3455 3832 39341.

The lower plot of Figure 2.6 shows an alternative approach which segmentsthe noise variance into piecewise constant intervals. The estimated changetimes are:

[l122 1522 1922 2323 2723 3129 35301

The 'spindles' can be estimated to three intervals in the signal from this in-formation.

2.3. FIR model

The Finite Impulse Response F I R ) model (see equation (A.4)in in AppendixA ) is standard in real-time signal processing applications as in communicationsystems, but it is useful in other applications as well, as the following controloriented example illustrates.

2.3.1. Ash recycling

Wood has become an important fuel in district heating plants, and in otherapplications. For instance, Swedish district heating plants produce about 500000 tons of ash each year, and soon there will be a 30 penalty fee for depositing



38 Atmlications

the ash as waste, which is an economical incentive for recycling. Th e ash fromburnt wood cannot be recycled back to na tu re directly, mainly because of itsvolatility.

We examine here a recycling procedure described n Svantesson et al.(2000). By mixing ash with water, a granular material is obtained. In thewater mixing process, it is of the utmost importance to get the right mixture.When too much water is added, the mixture becomes useless. The idea is tomonitor the mixture's viscosity indirectly, by measur ing the power consumedby the electric motor in the mixer. When th e dynamics between input waterand consumed power changes, it is impor tan t to stop adding water immedi-ately.

A simple semi-physical model is that the viscosity of the mixture is pro-

portional to the amount of water, where the initial amount is unknown. Th atis, the model with water flow as input is

where U t is the integrated water flow. This is a simple model of the FIR type(see equation (A.4)) with an ex tra offset, or equivalently a linear regression

(equation (A.3) ) .When the proportional coefficient O changes, the granula-tion material is ready.

Data,ueasurementndodel

100 /0 - I

L'0 100 200 300 4b0 500 6000000 900 '0 100 200 300 40000000000 900

~ ~

100- ~

50

:l

0 - ~

0 500 1000 1500 2000 OO 500000500000

100 ~ ~ ' c7

/-.

0 500 1000 1500 l 2000 t L 2 E c0O 500 1000 1500

Tlme [S] Time [S]2000

Figure 2.7. Water flow and power consumed by a mixer for making granules of ash fromburnt wood (a). At a certa in time instant, the dynamics changes and then the mixt ure isready. A model based approach enables the simulat ion of the output (b), and the changepoint n th e dynamics is clearly visible. (D ata provided by Thom as Svantesson, KalmarUniversity College, Sweden.)



2.4 AR model 39

A more precise model th at fits the da ta be tte r would be t o include aparameter for modeling that the mixture ries up after a while when no wateris added.

2.4. AR model

The AR model defined below is very useful for modeling time series, of whichsome examples are provided in this section.

2.4.1. Rat EEG

The same data as n 2.2.2 can be analyzed assuming an AR(2) model. Figure

2.8 shows the estima ted parameters from a segmentation algorithm (thereseems to be no significant parameter change) and segmented noise variance.Compared to Figure 2.6, th e variance is roughly one half of th at here, showingth at the model is relevant an d that the result should be more accurate. Th echange times were estimated here as:

[l085 1586 1945 2363 2949 3632 37351

Rat EEG and seamentationof AR(2) model and noise variance101

. ,I I

l

I Y [ I-loo500 I000 1500 2000500

O I

- 0 . 5 ~1 0 500 1000 1500 2000 2500000 3500 4000

I0 500 1000 1500 2000500 3000 3500 4000

Time [samples]

Figure 2.8. EEG for a rat and segmented noise variance from an AR(2) model.

2.4.2. Human EEGTh e d at a shown in Figure 2.9 are measured from the human occipital area.Before time t b the lights are turned on in a test room where a test person is



40 Atmlications

looking at something interesting. The neurons are processing information inthe visual cortex, and only noise is seen in the measurements. When the lightsare turned off, the visual cortex is at rest. The neuron clusters star t 10 Hz

periodical 'rest rhy thm '. The delay between t b and the actual time when therhy thm starts varies strongly. It is believed tha t th e delay correlates with, forexample, Alzheimer disease, and methods for estimating the delay would beuseful in medical diagnosis.

-60

0 100 200 300 400 500 60000

Time [samples]

Figure 2.9. EEG for a human in a room, where the light is turned off at time 387 After adelay which varies for different test objects, the EEG changes character. (Data provided byDr. Pasi Karjalainen, Dept. of Applied Physics, University of Kuopio, Finland.)

2.4.3. Earthquake analysis

Seismological da ta ar e collected and analyzed continuously all over the world.

One application of analysis aims to detect and locate earth quakes. Figure2.10 shows three of, in this case, 16 available signals, where the ear thquakestarts around sample number 600.

Visual inspection shows th at both the energy and frequency content un-dergo an ab rupt change at the onset time, and smaller but still significantchanges can be detected during the quake.

As another example, Figure 2.11 shows the movements during the 1989ear th quake in San Francisco. Th is da ta set is available in MATLABTMasquake. Visually, the onset time is clearly visible as a change in energy andfrequency content, which is again a suitable problem for an AR model.



2.4 AR model 41

Three of 16 measurement series from an earthquake-_

10

0

10

20200 400 6 800 1000 1200 1400600

10

0

I ' I

lo: 200000000 1000 l200 l400d00

1001 I

50

0

-500 200000000 1000 1200 1400600

Time [samples]

Figure 2.10. Example of logged dat a from an earthquake.

0.51Movement in north-south direction

0

-0.5Movement in east-west direction

'V I I

0

n r; I I.V

Movement in vertical direction

0

-0.5' I0 10 20 30 40 50

Time [S]

Figure 2.11. Movements for the 1989 San Francisco earthquake.



42 Atmlications

2.4.4. Speech segmentation

The speech signal is one of the most classical applications of the AR model.One reason is that it is possible to motivate it from physics, see Example 5.2.The speech signal shown in Figure 2.12, that will be analyzed later one, wasrecorded inside a car by the French National Agency for Telecommunications,as described n Andre-Obrecht (1988). The goal of segmentation might bespeech recognition, where each segment corresponds to one phoneme, or speechcoding (compare with Section 5.11).

5000Speech signal and segmentation u sing nAR(2)model

0

. II

-50000 500 1000 1500 2000 2500 3000 3500 4

Segmentationof AR(2) parameters

-1

2

Eszl500 1000 1500 ime2000samples]500 3000 3500 4000

Figure 2.12. A speech signal and a possible segmentation. (D ata provided by Prof. MicheleBasseville, IRISA, France, and Prof. Regine Andre-Obrecht, IRIT, France.)

2.5. ARX model

The ARX model (see equation (A.12) in Appendix A) an extension of t he ARmodel for dynamic systems driven by an input ut .

2.5.1. DC motor ault detection

An application studied extensively in Part IV and briefly in Part I11 is basedon simulated and measured dat a from a DC motor. A typical application isto use the motor a s a servo which requires an appropriate controller designedto a model of the motor. If the dynamics of the motor change with time, wehave an adaptive control problem. In th at case, the controller needs to beredesigned, a t regular time instan ts or when needed, based on the updatedmodel. Here we are facing a fundamental isolation problem:



44 Atmlications

Table 2.1. Data sets for th e DC lab motor

Data set 1 N o fault or disturbance.Data set 2 Several disturbances.Data set 3 Several system changes.Data set 4 Several disturbances and one (very late) system change.

Th e reference signal is generated as a square wave pre-filtered by 1.67/(s+1.67)(to get rid of an overshoot due to the zeros of th e closed loop system) with asmall sinusoidal perturba tion signal added (0.02 sin(4.5 . t ) ) .

The closed loop system from r to y is, with T, = 0.1,

0.1459q3 0.1137q2 0.08699q +0.0666q4 2.827q3 + 3.041q2 1.472q + 0.2698rt' (2.1)

t =

An alternative model is given in Section 2.7.1.

summarized in Table 2.1.Data consist of y , from the process under four different experiments, as

Disturbances were applied by physically holding th e outgoing motor axle.Syste m changeswere applied in software by shifting the pole in the DC motorfrom 3.5 to 2 (by including a block S + 3.5)/(s + 2) in t he regulator).

The transfer function model in (2.1) is well suited for the case of detectingmodel changes, while the state space model to be defined in (2.2) is better fordetecting disturbances.

1

0.8

0.6

0.4

-0.2

i om

-0.2

-0.4

0.6

-0.8

Figure 2.14. Model residuals defined as measurements subtracted by a simulated output.Torque disturbances and dynamical model changes are clearly visible, but they are hard todistinguish.



46 Applications

2.6. Regression model

Th e general form of the linear regression model (see equation (A.3) in Ap-

pendix A ) includes FIR, AR and ARX models as special cases, bu t also ap-pears in contexts other than modeling dynamical time series. We have alreadyseen the friction estimation application in Examples 1.2, 1.5 and 1.9.

2.6.1. Path segmentation and navigation in cars

This case study will be examined in Section 7.7.3. The data were collectedfrom test drives with a Volvo 850 GLT using sensor signals from th e ABSsystem.

There are everal commercial produc ts providing guidance systems or cars.These require a position estimate of the car , and proposed solutions are basedon expensive GPS (Global Positioning System) or a somewhat less expensivegyro. The idea is to compare he estimated position with adigital map.Digital street maps are available for many countries. The map and positionestimator, possibly in combination with traffic information tr ansmitted overthe FM band or from road side beacons, are then used for guidance.

We demonstrate here how an adaptive filter, in combination with a changedetector, can be used to develop an almost free position estimator, where no

additional hardware is required. It has a worse accuracy than its alternatives,but on the other hand, it seems to be able to find relative movements on amap, as will be demonstrated.

Posltlon

O t O S

-100 -

-200

-300 -

-400 - t=225 S

L o o 1 \

Meter

Figure 2.16. Driven path and possible result of manoeuvre detection. (D ata collected bythe author in collaboration with Volvo.)



2.6 Rearession model 47

Figure 2.16 shows an estimated path for a car, starting with three sharpturns and including one large roundabout. The velocities of the free-rollingwheels are measured using sensors available in the Anti-lock Braking System

(ABS). By comparing t he wheel velocities W and w1 on the right and left side,respectively, the velocity and curve radius R can be computed from

where L is the wheel base, r is the nominal wheel radius and E is the rela-tive difference in wheel radius on the left and right sides. Th e wheel radiusdifference E gives an offset in heading angle, and is thus quite important forhow the path looks (though it is not importan t for segmentation). I t is esti-mated on a long-term basis, and is in this example 2.5 . 10W3. Th e algorithmis implemented on a P C and runs on a Volvo 850 GLT.

The heading angle t and global position X t , t ) as functions of time canbe computed from

The sampling interval was chosen to Ts = 1 S. Th e approach requires th at theinitial position and heading angle X, , Yo, 0 are known.

The path shown in Figure 2.16 fits a stree t map quite well, but not per-fectly. Th e reason for using segmentation is to use corners, bends and round-abou ts for upd ating the position from the digital map. Any input to th e al-gorithm dependent on the velocity will cause a lot of irrelevant alarms, which

is obvious from the velocity plot in Figure 2.17. Th e ripple on the velocitysignal is caused by gear changes. Thus, segmentation using velocity depen-dent measurements should be avoided. Only the heading angle t is neededfor segmentation.

Th e model is tha t the heading angle is piecewise constan t or piecewise lin-ear, corresponding to straight paths and bends or roundabouts. The egressionmodel used is



48 Atmlications

I

50 100 150 20050Time [samples]

Figure 2.17. Velocity in the test drive.

2.6.2. Storing EKG signals

Databases for various medical applications are becoming more and more fre-quent. One of the biggest is the FBI fingerprint database. For storage effi-ciency, dat a should be compressed, without osing information. Th e fingerprintdatabase is compressed by wavelet techniques. The EKG signal examined herewill be compressed by polynomial models with piecewise constant parameters.For example, a linear model is

Figure 2.18 shows a par t of an EKG signal and a possible segmentation. Forevaluation, the following statistics are interesting:

Model Linear model

Error ( ) 0.85Compression rate ( ) 10

The linear model gives a decent error rate and a low compression rate.The compression rate is measured here as the number of pa rameters (here 2)times the number of segments, compared to the number of da ta. It says howmany real numbers have to be saved, compared to t he original data. Detailson th e implementation are given in Section 7.7.1. There is a design parameterin the algorithm to trade off between the error and compression rates.



2.9 Parameterizedon-linearodels 51

Measured and predic ted outputy,

I

I

-100 50 100 15000 250 300

Valve se tpoint7,estimated valve sta teX, and 8t41 I

-4 I0 50 100 150 200 250 300

Time [samples]

Figure 2.20. Valve input and output, and estimated valve position. (Data provided byDr. Krister Forsman, ABB Automation.)

2.9. Parameterized non linear models

Linear models, as linear regressions and state space models, can in many ap-plications give a satisfactory result. However, in some cases tailored non-linearmodels with unknown parameters might give a considerable improvement inperformance, or may even be the only alternative. We will give some examplesof the latter case here.

One application of models of the type (A.17) is for innovation processes.Innovation models are common in various disciplines, for instance:

Biology: growth processes in certain species or diseases or bacteria.

Economy: increase and satu ration of sales figures.

Th e two innovation processes used as illustration here are both based on acontinuous time sta te space model of t he kind:

The problem can be seen as curve fitting to a given differential function.



52 Atmlications

2.9.1. Electronic nose

We star t with a biological example of the innovation process presented n

Holmberg et al. (1998). The data are aken from an artificial nose, whichis using 15 sensors to classify bacteria. For feature extraction, a differentialequation model is used to map he ime series to a few parameters. The‘nose’ works as follows. The bac teria sample is placed in a substrate . To star twith, the bacteria consume substrate and they ncrease in number (t he growthphase). After a while, when the substrate is consumed, the bacteria star t todie. Let f t ) denote the measurements. One possible model for these is

with gt = x in equation (2.3). The physical interpretation of the parametersare as follows:

’ Scaling (amplitude)Q2 Growth actor steepness n descent)Q3 Growth initial time (defined by the 50% level)Q4 Death factor (steepness n ascent)Q 5 Death initial time (defined by the 50% level)

Th e basic signal processing idea of minimizing a least squares fit means

Sample 3 and sensor 9

40



400

- - Sensor espon seodel response

2001 I

0 5 Timesamples]0 15 20

Figure 2.21. Example of least squares fit of parametric differential equation to data.



2.9 Parameterizedon-linearodels 53

that :N

t l

is minimized with respect to 8. The measure of fit

min V 8 )

Ck f

can be used to judge he performance. For th e measurement responses inFigure 2.21, th e fit values are 0.0120, 0.0062 and 0.0004, respectively.

The classification idea is as follows: estimate the parametric model to eachof the 15 sensor signals for each nose sample, each one consisting of 19 timevalues. Then, use the parameters as features in classification. As describedin Holmberg et al. (1998), a classification rate of 76% was obtained usingleave-one-out validation, from five different bacteria. Details of the parameteridentification are given in Example 5.10.

2.9.2. Cell phone sales figures

Economical innovation models are defined by an innovation that is communi-cated over certain channels in a social system (Rogers, 1983). T he example

in Figure 2.22 shows sales figures for telephones in the sta ndards NMT 450and NMT 900. The latter is a newer standard, so the figures have not reachedthe satura tion level at th e final time. A possible application is to use the

Sales fig ures for NMT450

0.5

' 200 60 80 1002040

Sales figur es for NMTSOO4 x 105

Figure 2.22. Sales figures for t he telephone standards NMT 450 and NMT 900.



54 Atmlications

innovation model for the previous model to predict both saturation time andlevel from the current information.

Many models have been proposed, see Mahajan et al. (1990). Most of

them rely on the exponential function for modeling th e growth process andsaturation phase. One particular model is the so-called combinedmodel inWahlbin (1982):

In this case, N is th e final number of NMT telephone owners, a is a growth pa-rameter and b a saturation parameter. Details on the parameter identificationand a fitted curve are presented in Example 5.11.



Part II: Signal estimation





58 On-line amroa che s

For change detection, his will be labeled as a change in themeanmodel.The ask of determining Bt from yt will be referred to as estimation, andchange detection or alarming is the task of finding abrupt, or rapid, changes

in dt , which is assumed to sta rt at time L, eferred to as the change ime.Surveillance comprises all these aspects, and typical application is to monitorlevels, flows and so on in industrial processes and alarm for abnormal values.

The basic assumptions about model (3.1) in change detection are:

The deterministic component Bt undergoes an abr upt change at timet = k . Once this change is detected, t he procedure sta rts all over againto detec t the next change. T he alternative is t o consider B t as piecewiseconstant and focus on a sequence of change times k1, k z , . . . , n , as shownin Chapter 4. This sequence is denoted P , where both Ici and n are freeparameters. The segmentat ion problem is to find both the number andlocations of the change times in P .

In the statistical approaches, it will be assumed th a t t he noise is whiteand Gaussian et E N(0, R ) . However, the formulas can be generalized t oother d istributions, as will be pointed out in Section 3.A.

The change magnitude for a change at time k is defined as v a +l d k .

Change detection approaches can be divided into hypothesis tests and estima-

tion/information approaches. Algorithms belonging to the class of hypothesistests can be split into the parts shown in Figure 3.1.

For the change in the mean model, one or more of these blocks becometriv ial, but the pictu re is useful to keep in mind for the general model-basedcase. Estimat ion and information approaches do everything in one step, anddo not suit the framework of Figure 3.1.

The alternative to the on-para metric approach in this chapter is t o modelthe deterministic component of yt as a parametric model, and this issue will

Data e t , E tFilter -D Stopping ruleeas- istance

S t Alarm

Yt, U t

(a) Change detection based on statistics from one filter

] Averaging H hresholding pm(b) A stop ping rule consists of averaging and thresholding

Figure 3.1. The steps in change detection based on hypothesis tests. The stopping rule canbe seen as an averaging filter and a thresholding decision device.



3.2 Filterina amroaches 59

be the dealt with in Pa rt 111. It must be noted t ha t th e signal model 3.1) ,and thus all methods in thi s part, are special cases of what will be covered inPart 111.

This chapter presents a review of averaging strategies, stopping rules, andchange detection ideas. Most of the ideas to follow in subsequent chapters areintroduced here.

3.2. Filtering approaches

The standard approach in signal processing for separating the signal Bt andthe noise et is by (typically low-pass) filtering

Q = H ( d Y t . 3 4

Th e filter can be of Finite Impulse Response (FIR) or Infinite Impulse Re-sponse (IIR) type, and designed by any standard method (Butte rworth, heby-shev etc.). An alte rnative interpre tation of filters is da ta windowing

Q = c k Y t - k ,

k=O

where the weights should satisfyC

w k = 1. This is equal t o th e filteringapproach if the weights are interpreted as th e impulse response of t he (low-pass) filter H ( q ) , .e. w k = h k .

An important special case of these is the exponential forgetting window,orGeometr ic Moving Average ( G M A )

W k = 1 P , 0 5 X < 1.3.4)

A natural, and for change detection fundamental, principle is to use a slidingwindow, defined by

A more general approach, which can be labeled Fini te M oving Average ( F M A )is obtained by using arbitrary weights w k in the sliding window with heconstraint w k = 1, which is equivalent to a FIR filter.

3.3. Summary of least squares approaches

This section offers a summary of the adaptive filters presented n a moregeneral form in Chapter 5.




A common framework for the most common estimation approaches is tolet the signal estimate be the minimizing argument (arg min) of a certain lossfunction

8, = argmin&(O).

In the next four subsections, a number of loss functions are given, and thecorresponding estimators are derived. For simplicity, th e noise variance isassumed to be constant Ee: = R. We are interested in the signal estimate ,

and also its theoretical variance Pt E ( & and ts estimate P t . Foradaptive methods, the parameter variance is defined under the assumptionthat the parameter Qt is time invariant.

0 3.6)

3.3.1. Recursive east squares

To start with, the basic idea in least squares is to minimize the sum of squarederrors:

L

T

” i = l

Pt = -&(Q)-

t 2

1R,= , (e).3.10)

3.11)

‘his is an off-line approach which assumes that the parameter is time in-variant. If Qt is time-varying, adaptivity can be obtained by forgetting oldmeasurements using th e following loss function:

t

i = l



3.3 Summarv of leastauares amroaches 61

Here X is referred to as the orgetting factor. This formula yields the recursiveleast squares ( R L S )estimate. Note that the estimate t is unbiased only if thetrue parameter is time invariant. A recursive version of the RLS estimate is

e,= xet-1 + (1 X)yt

= et-1 + 1 X) t, (3.12)

where Et = (yt 6t-1) is the prediction error. This la tte r formulation of RLSwill be used frequently in the sequel, and a general derivation is presented inChapter 5.

3.3.2. The leas t squ ares over sliding window

Comput ing the east squares loss function over a sliding window of size L gives:

t

( e ) c (yz e 2i=t-L+l

1 t

e t = c Yii=t-L+l

= l +Yt Yt-L

t

i=t-L+l

P t = , (e)

k t = - (e).

1 -

L1 -

L

This approach will be labeled t he Windowed Least Squares (WLS) method.Note that a memory of size L is needed in this approach to sto re old measure-ments.

3.3.3. l eas t m ean square

In the Least Mean Square ( L M S )approach, the objective is to minimize

( e )AE ( y t (3.13)




by a stochastic gradient algorithm defined by

3.14)

Here, p is the step size of the algorithm. The expectation in 3.13) cannot beevaluated, so the standa rd approach is to just ignore it. Differentiation thengives the LMS algorithm:

Ot = et-l +p&t . 3.15)

Th at is, for signal estimation, LMS and RLS coincide with p = 1 X. This isnot true in the general case in Chapter 5.

3.3.4. The Kalman filter

One further alterna tive s to explicitly model the parame ter ime-variations asa so called random walk

Ot+l = ot + ut. 3.16)

Let the variance of the noise ut be Q. Then the Kalman i l te r, as will be

derived in Chapter 13, applies:

3.17)

3.18)

If the assumption 3.16) holds, the Kalman filter is the optimal estimator inthe following meanings:

It is the minimum variance estimator if ut and et (and also the initialknowledge of d o are ndependent and Gaussian. Th at is, here is noother estimator that gives a smaller variance error Var(& O t ) .

Among all l inear estimators, it gives the minimum variance error inde-pendently of the distribution of th e noises.

Since minimum variance is related to the least squares criterion, we alsohave a least squares optimality under the model assumption 3.16).

It is the conditional expectation of d t , given the observed values of yt.

This subject will be thoroughly treated in Chapter 8.



3.4 Stamina rules and the CUSUM test 63

Example 3 1 Signal estimation using linear filters

Figure 3.2 shows an example of a signal, and the estimates from RLS,LMS and KF, respectively. The design parameters are X = 0.9 p = 0.1 andQ = 0.02 respectively. RLS and LMS are identical and the Kalman filter isvery similar for these settings.

3.4. Stopping rules and the CUSUM test

A stopping rule is used in surveillance or giving an alarm when Bt has exceededacertain hreshold. Often, astopping rule is used as a part of a changedetection algorithm. It can be characterized as follows:

The definition of a stopping rule here is th at , in contrast to changedetection, no statistical assumptions on its input are given.

The change from et = 0 to a positive value may be abrupt , linear orincipient, whereas in change detection the theoretical assumption in thederivation is tha t the change is abrupt .

There is prior information on how large the threshold is.

An auxiliary test stat istic gt is introduced, which is used for larm decisionsusing a threshold h. The purpose of th e stopping rule is to give an ala rm when

1.5 -I a

0 50 10050

Figure 3.2. A signal observed with noise. Th e almost identical lines show th e signal estimatefrom RLS LMS and KF.



3 .4 Stomina rules and the CUSUM test 65

3.4.2. One sided tests

As shown in Figure 3.3, the stopping ru le first averages the inputs st to get atest statistic g t , which is then thresholded. An alarm is given when gt = > hand the test statistic is reset, gt = 0. Indeed, any of the averaging strategiesdiscussed in Section 3.2 can be used. In particular, th e sliding window andGMA filters can be used.

As an example of a wellknown combination, we can take the squared resid-ual as distance measure from the no change hypothesis, st = E ? , and averageover a sliding window. Then we get a X’ test , where the distribution of gt isx’ L) f there is no change. There is more on this issue in subsequent chap-ters. Particular named algorithms are obtained for the exponential forgetting

window and finite moving average filter. In this way, stopping rules based onFMA or GMA are obtained.The methods so far have been linear in data, or for the X’ test quadratic

in da ta. We now tu rn our attent ion to a fundamental and historically veryimportant class of non-linear stopping rules. First , the Sequential Probabil ityRatio Test (SPRT) is given.

Algorithm 3.1 SPRT

gt = gt-1 + t 3.24),.g t = O , a n d k = t i f g t < a < O 3.25)

gt = 0, and t , = t and larm if gt > h > 0. 3.26)

Design parameters: Drift v, hreshold h and reset level a .Output: Alarm time(s) t,.

In words, the test statistic gt sums up its input s t , with t he idea t o givean alarm when the sum exceeds a threshold h. With a white noise input,the test statistic will drift away similar t o a random walk. There are twomechanisms to prevent this natural fluc tuation . To prevent positive drifts,eventually yielding a false alarm, a small drift term v is subtracted at eachtime instant. To prevent a negative drift, which would increase the time todetection after a change, the test statistic is reset to 0 each time in becomesless than a negative constant a .

Th e level crossing parameter a should be chosen to be small in magnitude,

and t has been thoroughly explained why a = 0 is a good choice. Thisimportant special case yields th e cumulative sum ( CUSUM) algorithm.



66 On-line amroaches

Algorithm 2 CUSUM

gt = gt-1 + St 3.27),.g t = 0 ,n d k = t i f g t < O 3.28)

gt = 0, and t a = t and larm if g t > h > 0. 3.29)

Design parameters: Drift U and threshold h.Output: Alarm time(s) t , and estimated change time L.

Both algorithms were originally derived in th e context of quality control (Page,1954). A more recent reference with a few variations analyzed is Malladi an dSpeyer (1999).

In both SPRT and CUSUM, the alarm time ta s th e primary output, andthe drift should be chosen as half of the critical level tha t must not be exceededby the physical variable Bt. A non-standard, but very simple, suggestion forhow to estimate the hange time is included in the parameter L ,but rememberthat the change is not necessarily abrupt when using stopping rules in general.The estimate of t he change time is logical for the following reason (althoughthe change location problem does not seem t o be dealt with in literature inthis context). When Ot = 0 the test statistic will be set t o zero at almost everytime instant (depending on the noise evel and if a < U is used). After achange to Ot > v , gt will st ar t to grow and will not be reset until the ala rmcomes, in which case is close to the correct change time. As a rule of thumb,the drift should be chosen as one half of t he expected change magnitude.

Robustness and decreased false alarm rate may be achieved by requiring

several gt > h. This is in quality control called a run test.

Example 3 2 Surveillance using th e CUSUM tes t

Suppose we want to make surveillance of a signal t o detect if it s levelreaches or exceeds 1. Th e CUSUM test with U = 0.5 and h = 5 gives anoutput illustrated in Figure 3.4. Shortly after th e level of the signal exceeds0.5, the test stat istic star ts to grow until it reaches the threshold, where weget an alarm. After this, we continuously get alarms for level crossing. A runtest where five stops in the CUSUM test generates an alarm would give analarm at sample 150.



68 On-line amroaches

Algorithm 3.3 The CUSUM LS filter

ot = Y kt o

k=to+l

After an alarm, reset g:') = 0, gj2) = 0 and t o = t .Design parameters: v , h.output: et .

Since the ou tput from the algorithm is a piecewise constan t parame terestimate, the algorithm can be seen as a smoothing filter. Given the detectedchange times, the computed parameter estimates are the est possible off-lineestimates.

Example 3.3 Surveillance using the CUSUM test

Consider the same signal as in Example 3.2. Algorithm 3.3 gives the signalestimate and the test statistics in Figure 3.5. We only get one alarm, at time79 and the parameter estimate quickly adapts afterwards.

A variant is obtained by including a forgetting factor in the least squaresestimation. Technically, this corresponds to an assumption of a signal thatnormally changes slowly (caught by t he forgetting factor) but sometimes un-dergoes abrupt changes.



3 .4 Stomina rules and the CUSUM test 69

0 5000

6est statistic q

0 5000

Figure 3.5. A signal and its measurements , with an estimate from th e CUSUM filter inAlgorithm 3.3. The solid line shows the test statistic g t ) in the CUSUM test.

Algorithm 3.4 The CUSUM R l S filter

After an alarm, reset g ) = 0, gj2) = 0 and 8, = yt.

Design parameters: v , h.o u tp u t : e,.

The point with the reset 8, = yt is that the algorithm forgets all old infor-

mation instantaneously, while, at the same time, avoids bias and a transien t.Finally, some general advice or tuning these algorithms, which are defined

by combinations of the CUSUM test and adapt ive filters, are given.




Tuning of CUSUM filtering algorithmsStart with a very large threshold h. Choose v to one half of theexpected change, or adjust v such that gt = 0 more than 50% of

the time. Then set the threshold so th e required number of falsealarms (this can be done automatically) or delay for detection isobtained.

If faster detection is sought, tr y to decrease v.

If fewer false alarms are wanted, t ry to increase v.

If there is a subset of th e change times t hat does not makesense, try to increase v.

3.5. likelihood based change detection

This section provides a compact presentation of the methods derived in Chap-ters 6 and 10, for the special case of signal estimation.

3.5.1. likelihood heory

Likelihood is a measure of likeliness of what we have observed, given theassumptions we have made. In th is way, we can compare, on the basis ofobserved data, different assumptions on the change time. For th e model

yt = 8 + e t , et E N(0, R ) , 3.30)

the likelihood is denoted p(yt18,R ) or Zt(8, R ) . This should be read as “the ike-lihood for data gt given the parameters 8, R”. Independence and Gaussianitygive

The parameters are here nuisance, which means they are irrelevant for changedetection. The re are two ways to eliminate them if they are unknown: esti-mation or marginalization.



3.5 Likelihoodased chanaeetection 71

3.5.2. M1 estimation of nuisance parameters

The Muzimum Likelihood ( M L )estimate of 19 (or any parameter) is formallydefined as

Rewrite t he exponent of (3.31) as a quadratic form in 19

t

C y 2 e)2= t y” 2ey +e2>= t 0 +y y2)i=l

(3.33)

where

1

t .

t

Y = -CYi2=1

ty”=:xY;2=1

denote sample averages of y and y2, respectively. From this, we see that

J M L l- Y = - X Y i

t .2=1

3.34)

which is, of course, a wellknown result. Similarly, th e joint ML estimate of 8and R is given by

4,klML arg maxZt(8, R )0 3

= 4,arg mpZ t(6, R ) )

= 4,arg max (27r-t /2~-t /2e-Y ~ >.5 2

By taking the logarithm, we get

-2 log Zt(4, R ) = t log(27r) + t log(R) + - y2 y2).-

R

Setting the derivative with respect to R equal to zero gives

3.35)




Compare to the formula Var(X) = E(X2) E ( X ) ) 2 . ote, however, that theML estimate of R is not unbiased, which can be easily checked. An unbiasedestimator is obtained by using th e normalization factor l / ( t 1) instead of

l / t in (3.35).

Example 3.4 Likelihood estimation

Let the true values in 3.30) be 19= 2 and R = 1. Figure 3.6 shows how th elikelihood (3.31) becomes more and more concentrated around the true valuesas more observations become available. Note th a t the distribution is narroweralong the 8 axle, which indicates that successful estimation of the mean iseasier th an the variance. Marginalization n this example can be elegantlyperformed numerically by projections. Marginalization with respect to R and8, respectively, a re shown in Figure 3.7. Another twist in this example is t o

N = l~ --I . N = 5

(c) 4

Figure 3 6 Likelihood of the me an f3 and variance R from l , 5 , 20 and 40 observations,respectively.




Marginal distribution or R Marginal distribution or e

ML estimate of 0 as a function of time ML estimate of R as a unction of time3 7 1 . 5 7 1

2 . 5 k

rr.

2 0.5

1.50 20 40 OO 20 40

Figure 3.7. Upper row of plots shows the marginalized likelihood using 40 observations.Lower row of plots shows how the esti mates converge in time.

stu dy the tim e response of the maximum likelihood estimator, which is alsoillustrated in Figure 3.7.

The numerical evaluation is performed using th e pointmass ilter, seeBergman

(1999),where a grid of the parameter space is used.

3.5.3. M1 estimation of a single chang e time

We will now extend M L to change time estimation. Let p(ytlk) denote thelikelihood' for measurements yt = y1, y2 , . . . ,yt, given the change time k . Thechange time is then estimated by the maximum likelihoodprinciple:

i argkaxp(ytlk). 3.36)

This is basically an off-line formulation. Th e convention is th at i t shouldbe interpreted as no change. If the test is repeated for each new measurement,an on-line version is obtained, where there could be efficient ways to computeth e likelihood recursively.

If we assume that 6 before and afte r the change are ndependent, helikelihood p ( y t l k )can be divided into two part s by using the rule P A , BIC) =

P AIC)P BIC), hich holds if A and B are independent. Th at is,

P W l k ) = P(Y"k)P(YL+llk) = P(Y"P(YL+l). 3.37)

'This notat ion is equivalent to th a t commonly used for conditional distribution.




Here yh+l = y,++1,. . . ,yt. Th e conditioning on a change at time k does notinfluence the likelihoods in the right-hand side and is omitted. Tha t is, allthat is needed is to compute t he likelihood for da ta in all possible splits of

da ta into two par ts. The number of such splits is t, so the complexity of thealgorithm increases with time. A common remedy to this problem is to usethe sliding window approach and only consider k E [t L t ] .

Equation (3.37) shows that change detection based on likelihoods brakesdown to comput ing the likelihoods for batches of d ata, and then combinethese using (3.37). Section 3.A contains explicit formulas for how to computethese likelihoods, depending on what is known ab ou t the model. Below isa summary of the resu lts for the different cases needed to be distinguishedin applications. Here M M L refers to M ax im um Marginalized Likelihood and

M G L refers to M ax im um Generalized Likelihood. See Section 3.A or Chapters7 and 10 for details of how these are defined.

1. The parameter Q is unknown and the noise variance R is known:

tR-2 log Z,n/rML(R) (t 1) og(27rR) + log(t) + R .

(3.38)

(3.39)

Note that s a compact and convenient way of writing (3.35), andshould not be confused with an estimate of what is assumed to be known.

2. The parameter Q is unknown and the noise variance R is unknown, butknown to be constant. The derivation of this practically quite interestingcase is postponed to Chapter 7.

3. The parameter Q is unknown and the noise variance R is unknown, andmight alter after the change time:

-2og l,n/rGLt log(27r) + + t log(&), (3.40)

-2 log l,n/rML Mt log(27r) + (t 5) + log@) ( t 3) log@ 5) +(t 3)og(&). (3.41)

4. The parame ter Q is known (typically to be zero) and the noise varianceR is unknown and abruptly changing:

-2logl,n/rGL(Q) M tlog(27r)

+t

+tlog(R), (3.42)

-2 log l t L ( 0 ) M t log(27r) + (t 4) (t 2) log@ 4) + (t 2) log@).(3.43)




Note th at th e last case is for detection of variance changes. The likelihoodsabove can now be combined before and afte r a possible change, and we getthe following algorithm.

Algorithm 3 5 likelihood based signal chang e detection

Define the likelihood mzn = p ( & ) , where Zt = Zpt. The log likelihood for achange at time k is given by

A

og Zt(k)= og Z1:k og Zk+l:t,

where each log likelihood is computed by one of the six alternatives 3.38)-3.43).

The algorithm is applied to the simplest possible example below.

Example 3 5 ikelihood estimation

Consider the signal

Y t = {0 + e ( t ) , for 0 < t 5 2501 + e ( t ) , for 250 < t 5 500.

The different likelihood functions as a function of change time are illustratedin Figure 3.8, for the cases:

e unknown Q and R ,

e R known and Q unknown,

e Qknown both before and after change, while R is unknown.

Note that MGL has problem when th e noise variance is unknown.

The example clearly illustrates that marginalization is to be preferred tomaximization of nuisance parameters in this example.

3.5.4. likelihood ratio deas

In the context of hypothesis testing, t he likelihood rati os rath er than the ike-lihoods are used. The Likelihood Ratio ( L R )est is a multiple hypotheses test,where th e different jump hypotheses are compared to th e no ju mp hypothesispairwise. In the LR test, the jump magn itude is assumed to be known. Thehypotheses under consideration are

H0 : no jump

H l ( k ,v ) a jump of magnitude v at time k .




Figure 3.8. Maximum generalized and marginalized likelihoods, respectively (MGL above,M M L below). Cases: unknown 8 and R in (a), unknown 8 in (b) and unknown R (8 knownbot h before and after change) in (c).

The test is as follows. Introduce t he log likelihood ratio for the hypotheses asthe test statistic:

The factor 2 is just used for later notational convenience. The notation g t

has been chosen to highlight that his is a distancemeasure between twohypotheses. We use th e convention th a t Hl ( t , v ) = Ho, so that k = t meansno jump. Then the LR estimate can be expressed as

i L R argmk=gt(k, v > , (3.45)

when v is known.There are again two possibilities of how to eliminate the unknown nuisance

parameter v. Maximization gives the GLR test, proposed for change detectionby Willsky and Jones (1976), and marginalization results in the MLR test(Gustafsson, 1996).

Starting with the likelihood ratio in (3.45), the GLR test is a double max-imization over k and v ,

By definition, D ( k ) is the maximum likelihood estimate of v , given a jump at

time k . The jump candidate in the GLR test is accepted if

gt(i ,D(k)) > h. (3.46)




The idea in MLR is to assume tha t v is a random variable with a certain priordistribution p ( v ) . Then

As a comparison and summary of of GLR and MLR, i t follows from th e deriva-tions in Section 3.A.2 tha t th e two tests are

(3.47)

(3.48)

Here the prior in MLR is assumed flat, p ( v ) = 1. The theoretical distributionof g F L R ( k )is ~ ’ ( 1 ) . he threshold can be taken from standard distributiontables. It can be remarked tha t the distribution is rath er rob ust to the as-sumed Gaussian distribution. Since 1) 5 is an average, (2) averages convergeaccording to th e central limit theorem to a Gaussian distribution, and (3) thesquare of the Gaussian variable is x2(1)distributed, we can conclude that thethe test statistic approaches x2 1) distribution asymptotically.

GLR and MLR are conceptually different, since they represent two different

philosophies.

GLR is a hypothesis est where g t ( k , D ( k ) ) s always positive. Thus athreshold is needed, and the size of the threshold determines how largea change is needed to get an alarm.

MLR is an estimation approach, so there is no threshold. The threshold0 in (3.48) is explicitely written out just t o highlight the similarities inimplementation (note, however, th at th e constant can be interpreted as

a non-tunable threshold). N o change is estimated if g t k ) < 0 for allk < t , since g t ( t ) = 0 and k = t means that no change has occuredyet. Here, the prior on U can be used to specify what a sufficiently largechange is.

More on these issues can be found in Chapters 7 and 10.

3.5.5. Model validation based on sliding windows

The basic idea in this model validation approach is to compare two modelswhere one is obtained from a sliding window. Any averaging function can beused, but we will only discuss a rectangular sliding window here. This model iscompared to a nominal model obtained from either a longer window or off-line




analysis (system identification or a physical model). Equation (3.49) illustratetwo possibilities of how to split data. The op tion in (3.49.a) is common inapplications, but the one in (3.49.b) is somewhat simpler to analyze, and will

be used here:

(3.49)

The slow filter, that estimates M1, uses d at a from a very large sliding window,or even uses all past d ata (since the last alarm). Then two estimates, 81 and6 2 with variances P1 and P2, are obtained. If there is no ab rupt change inth e da ta , these estimates will be consistent. Otherwise, a hypothesis test willreject H0 and a change is detected.

There are several ways to construct such a hypothesis test. The simplestone is to stu dy the difference

6 1 6 2 E N(0, P1 +Pz), under H0 (3.50)

or, to make the tes t single sided,

6 1 6 2 ) 2E x 2 ( 1 ) , under H0

p + p2

from which a standard hypothesis test can be formulated.From (3.49.b), we immediately see th at

t -L R

CYi, p1 =

=1

t - L

61 =- L .

so that

if t is large.

R t Rp1 P2 = M

L t - L L

(3.51)




Example 3 6 Hypothesis rests: Gaussian and x

Suppose we want to design a test with a probability of false alarm of 5%,that is,

161 621P dm > h ) = 0.05.

A table over the Gaussian distribution shows th a t P ( X < 1.96) = 0.975, sothat P(lXl < 1.96) = 0.95, where X E N( 0, l) . Th at is, the test becomes

Similarly, using a squared test statistic and a table over th e x2(1)distribution( P ( X 2 < 3.86) = 0.95) gives

6 1 2 > 2 3 3.86 = 1.962,P1 + p2

which of course is the same test as above.

The GLR test restricted to only considering the change time t L gives

This version of GLR where no search for the change time is performed iscommonly referred to as Brandt’s GLR (Appel and Brandt, 983). The slidingwindow approach is according to (3.49.a).

Algorithm 3 6 Brandt’s GLR

Filter statistics:




Test statistics:

gt = gt-1 + St

Alarm if gt > h

After an alarm, reset gt = 0 and t o = t .Design parameters: h, L (also Ri if they are known).

output: i p

The follow general design rules for sliding windows apply here:

Design of sliding windowsThe window size s coupled to th e desired delay for detection.A good sta rting value is the specified or wanted mean delay fordetection. Set the threshold t o get th e specified false alarm rate.Diagnosis:

Visually check the variance error in the parameter estimatein the sliding window. If the variance is high, this maylead to many false alarms and the window size should beincreased.

If the estimated change times look like random numbers, toolittle information is available and the window size should be

increased.

If the change times make sense, th e mean delay for detectionmight be improved by decreasing the window.

As a side comment, it is straightforward to make the updates n Algorithm3.6 truly recursive. For instance, th e noise variance estimate can be computedby f i t = 2 J2 where both the signal and squared signal means are updated

recursively.More approaches to sliding window change detection will be presented in

Chapter 6.



3.6 Applications 81

3.6. Applications

Two illustrative on-line applications are given here. Similar applications on

the signal estimation problem, which are based on off-line analysis, are givenin Section 4.6. In recent lit erature , there are applications of change detectionin a heat exchanger (Weyer et al., 2000), Combined Heat and Power (CHP )unit (Thomson et al., 2000) and growth rate n a slaughter-pig production unit(Nejsum Madsen and Ruby, 2000).

3.6.1. Fuel monitoring

Consider the fuel consumption filter problem iscussed in in Examples 1.1, 1 4

and 1.8. Here Algorithms 3.2, 3.3 and 3.6 are applied. These algorithms areonly capable to follow abrupt changes. For incipient changes, the algorithmwill give an alarm only after th e to ta l change is large or after a long time. Inboth algorithms, it is advisable to include da ta forgetting, as done in Algorithm3.4, in the param eter estimation to allow for a slow drift in the mean of th esignal.

Table 3.1 shows how th e above change detection algorithms perform on hissignal. Figure 3.9 shows th e result using th e CUSUM LS Algorithm 3.3 withthreshold h = 5, drift v = 0.5, but in contrast to th e algorithm, we here use

the squared residuals as distance measure S t . Compared to th e existing filter,the tracking ability has improved and, even more importantly, the accuracygets better and better in segments with constant fuel consumption.

25

20

5 5-

z-a,

1 0

5

l

0

0

Figure 3.9. Filtering with the CUSUM LS algorithm.




Table 3.1. Simulation result for fuel consumption

Method

60 8.403= 20. v = 3. L = 3randt's GLR20 8.394= 3 , v = 2USUM19R = 0.9, quant = 0.3LS

kFlopsDLiesign parameters

IMLI I I I

u2 = 3 I 14 I 8.02 I 256

3.6.2. Paper efinery

Consider now the paper refinery problem from Section 2.1.2, where th e powersignal to th e grinder needs to be filtered. As a preliminary analysis, RLSprovides some useful information, as seen from the plots in Figure 3.11.

There are two segments where the power clearly decreases quickly. Fur-thermore, there is a starting and stopping transient that should be de-tected as change times.

Th e noise level is fairly constant (0.05) during the observed interval.

The CUSUM LS approach in Algorithm 3.3 is applied, with h = 10 and v = 1.Figure 3.12 shows gt which behaves well and reaches the threshold quickly afterlevel changes. Th e resulting power estimate (bo th recursive and smoothed) arecompared to th e filter implemented in Sund in th e right plots of Figure 3.10.

l o o o l l2000 4000000000 100002000

2000 4000000000 1000020002

I l l0 2000 4000000000 100002000

Figure 3.10. Est ima ted power signal using th e filter currently used in Sund ,LS filter an d smoother, respectively.

the CUSUM



84 On-line amro ach es

3.A. Derivations

3.A.1. Marginalization of likelihoods

The idea in marginalization is to assign a prior to th e unknown parameter andeliminate it.

Marginalization of 0

A flat prior p 0 ) = 1 is used here for simplicity. Othe r proper density func-tions, which integrate to one, can of course be used but only a few give explicitformulas. The resulting likelihood would only marginally differ from the ex-pressions here. Straightforward integration yields

where the quadratic form 3.33) is used. Identification with the Gaussiandistribution II: E N(y, R / t ) , which integrates to one, gives

(3.52)

That is,

R M L R

-2 log Zt(R) = -2 logp(ytlR) = t- og + t l og (2 rR) og (2~) .R t(3.53)

Marginalization of R

In the same way, conditioning th e da ta distribution on the parame ter onlygives



3.A Derivations 85

where p ( R ) = 1 has been assumed. Identification with th e inverse Wishartdistribution (or gamma distribution in this scalar case),

(3.54)

gives m = t 2 and V = C yi Q). The distribution integrates to one, so

Hence,

-210gz,(o) = -2logp(ytle) = t i o g ( 2 ~ ) ( t 2)10g(2) 210gr t 2 2 )+ ( t 2) log E( 2 .

C l ) (3.56)

This result can be somewhat simplified by using Stirling’s formu la

r (n+1) M &nn+1/2 e .n (3.57)

The gamma func t ion is related to th e actorial by n = F ( n + l ) . It follows th at

th e following expression can be used for reasonably large t (t > 30 roughly):

t - 4

2= ( t 4) ( t 2) log O(log(t)).

In the ast equality, a erm 3 log((t-4)/2) was subtrac ted and O(log(t )) ad ded.With thi s small trick, we get a simplified form of the likelihood as follows:

Z, Q) M tlog(2i.r) + ( t 4) ( t 2) log@ 4) + ( t 2) log

(3.58)

Marginalization of both and R

When eliminating bo th Qand R by marginalization, there is the opt ion to star tfrom either (3.55) or (3.52). It turns out tha t (3.52) gives simpler calculations:

d Y t = /P YtlR)p R)dR

= /(2nR) -(t-l)/2 t- 112 e ~R l t d R .R




Again, identification with th e inverse Wishart distribution (3.54) g’ ves m =t 3 and V = tR, and

Using Stirling’s approximation (3.57) finally gives

Azt = -210gp(y t )

M t log(27r) + (t 5) + log(t) t 3) log@ 5) + ( t 3) log(&). (3.59)

3.A.2. likelihood ratio approaches

Th e use of likelihood ratios is quite widespread in the change detection lit-erature , and its main motivation is the Neyman Pearson lemma (Lehmann,1991). For change detection, it says that th e likelihood ratio is the opt imaltest statistic to test H0 against H 1 ( k ,v ) for a given change v and a givenchange time k . It does, however, not say any thing abo ut the composite testwhen all change times k are considered.

We will, without loss of generality, consider the case of known Ot = 0 beforethe change and unknown O t = v after the change,

0 if i < L,v if i > k .

oi =

The likelihood ratio

We denote the likelihood ratio with gt to highlight tha t it is a distance measurebetween two hypothesis, which is going to be thresholded. The definition is

(3.60)

As can be seen from (3.31), all constants and all terms for i < k will cancel,leaving

4 t < tl

g t ( k , v )= c Yi 2 + c Hl

Ri = k + l i = k + l

= c v 2 2 4R

i = k + l

2v t

= c Yi- i).i = k + l

(3.61)



3.A Derivations 87

The generalized likelihood ratio

The Generalized Likelihood Ratio ( G L R )is obtained by maximizing over thechange magnitude:

From 3.61) , the maximizing argument is seen to be

1t - k

t

D ( k ) = c i.i = k + l

3.62)

3.63)

which of course is the maximum likelihood estimate of th e change, conditionedon the change time k . The GLR in 3.61) can then be written

a L R 2 q k )g f L R ( k ) = gt ( k , q k ) ) W( t k ) ( f i ( k ) -) = 2 ( k ) . 3.64)R 2 R / @

The lat ter form, the squared estimate normalized by the estimation variance,will be extended to more general cases in Chapter 7.



Off line approaches4.1. asics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.2 . Segmentation riteria . . . . . . . . . . . . . . . . . . . . 91

4.2.1. ML change time equence estimation . . . . . . . . . . . . 914.2.2. Information ased egmentation . . . . . . . . . . . . . . 92

4.3 . On-line local search for optimum . . . . . . . . . . . . . 94

4.3.1. Local tree earch . . . . . . . . . . . . . . . . . . . . . . . 94

4.3.2. A simulation xample . . . . . . . . . . . . . . . . . . . . 95

4.4 . Off-line global search for optimum . . . . . . . . . . . . 98

4.4.1. Local minima . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.4.2. An MCMC approach . . . . . . . . . . . . . . . . . . . . . 101

4.5. Change point estimation . . . . . . . . . . . . . . . . . . 02

4.5.1. The Bayesian approach . . . . . . . . . . . . . . . . . . . 1034.5.2. The maximum likelihood approach . . . . . . . . . . . . . 104

4.5.3. Anon-parametric pproach . . . . . . . . . . . . . . . . . 104

4.6.Applications . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.6.1. Photon emissions . . . . . . . . . . . . . . . . . . . . . . . 106

4.6.2. Altitude sensor quality . . . . . . . . . . . . . . . . . . . . 107

4.6.3. Rat EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.1. BasicsThis chapter surveys off-line formulations of single and multiple change pointestimation . Although the problem formulation yields algorithms that processda ta batch.wise, many important algorithms have natural on-line implemen-tations and recursive approximations . This chapter is basically a projection ofthe more general results in Chapter 7 to the case of signal estimation Thereare, however. some dedicated algorithms for estimating one change point off-line that apply to the current case of a scalar signal model . In the literatureof ma thematical statistics. this area s known as change point estimation.

In segmentation. the goal is to find a sequence k n ( k ~ .2. . . . . n ) of timeindices. where both the number n and the locations Ici are unknown. such tha t





90 Off-line amroaches

the signal can be accurately described as piecewise constant, i.e.,

is a good description of the observed signal yt. The noise variance will bedenoted E(e i) R. The standard assumption is that et E N(0, R ) , but thereare other possibilities. Equation (4.1) will be the signal model used throughoutthis chapter, but it should be noted t hat an imp ort ant extension to th e casewhere the parameter is slowly varying within each segment is possible withminor modifications. However, equation (4.1) illustra tes the basic ideas.

One way to guarantee that the best possible solution is found is to considerall possible segmentations k n , est imate the mean in each segment, and thenchoose the particula r k n that minimizes an optimality criteria,

k n arg min V ( k n ) .n > l , O < k l < . . . < k , = N

The procedure is illustrated below:

Note that the segmentation k n has n 1 degrees of freedom. Two types ofoptimality criteria have been proposed:

0 Statistical criteria: The maximum likelihood or maximum a posterioriestimate of k n is studied.

0 Information based criteria: Th e information of da ta in each segment isV i ) the sum of squared residuals), and he to ta l nformation is the sumof these. Since the total information is minimized for the degenerated

solution k n 1 , 2 , 3 , . . ,N , giving V i ) 0, a penal ty term is needed.Similar problems have been studied in th e context of model structureselection, and from this literature Akaike’s AIC an d BIC criteria havebeen proposed for segmentation.

The real challenge in segmentation is to cope with th e curse of dimensionality.The number of segmentations k n is 2N (there can be either a change or nochange at each time instant). Here, several strategies have been proposed:

0 Numerical searches based on dynamic programming or Markov ChainMonte Carlo (MCMC) techniques.

0 Recursive local search schemes.



4.2 Seamentationteria 91

A section is devoted to each of these two approaches. Firs t, a summary ofpossible loss functions, or segmentation criteria, V k n ) s given. Chapter 7 isa direct continuation of this chapter for th e case of segmentation of parameter

vectors.

4.2. Segmentation criteria

This section describes he available stat istical and nformation based optimiza-tion criteria.

4.2.1. M1 change t ime sequence estimation

Consider first an off-line problem, where the sequence of change times knkl 2 . . , k , is estimated from the data equence yt. Later, on-line algorithmswill be derived from this approach. We will use the likelihood for da ta , giventhat the vector of change points is p ytlkn).

Repeatedly using independence of 8 in different segments gives

using the notation and segmentation in 4.3) . The notational convention isthat no change is estimated if n 1 and kl t . The likelihood in eachsegment is defined exactly as in Section 3.5.3:

0 The advantage of dealing with multiple change times is th at it providesan elegant approach to the start-up problem when a change is detected.A consequence of this is that very short segments can be found.

0 The disadvantage is that the number of likelihoods 4.4) increases expo-nentially with time as 2t.



94 Off-line approaches

describe the signal up to a certain accuracy, given tha t both the parame tervectors and the prediction errors are stored with finite accuracy.

Both AIC an d BIC are based on a n assumption on a large number of

da ta, and its use in segmentation where each segment could be quite shortis questioned n Kitagawa and Akaike (1978). Simulations n Djuric (1994)indica te that AIC and BIC ten d to over-segment dat a in a simple examplewhere marginalized ML works fine.

4.3. On line local search for optimum

Comput ing the exact likelihood or nformation-based estimate is computa-tionally intractable because of the exponential complexity. Here a recursiveon-line algorithm is presented and illustrated with a simulation example, alsoused in the following section.

4.3.1. local tree search

Th e exponential complexity of t he segmentation problem can be illustratedwith a tree as in Figure 4.1. Each branch marked with a ‘1’ corresponds to anabrup t change, a jump, and ‘0’ means no change in the signal level. In a localsearch, we make a time to time decision of which branches t o examine. Th eother alternative using global search strategies, examined in the next section,decides which branches to examine on an off-line basis.

The algorithm below explores a finite memory property, which has muchin common with the famous Viterbi algorithm in equalization; see Algorithm5.5 and the references Viterbi (1967) and Forney (1973).

Algorithm 4 1 Recursive signal segmentation

Choose an optimality criterion. The opt ions are likelihoods, a posterioriprobabilities or information-based criteria.

Compute recursively the optimality criterion using a bank of least squaresestimators of th e signal mean value, each filter is matched to a particularsegmentation.

Use the following rules for maintaining the hypotheses and keeping thenumber of considered sequences M ) ixed:

a) Let only the most probable sequence split.

b) Cut off the least probable sequence, so only M ones are left.c) Assume a minimum segment length: let th e most probable sequence

split only if i t i s not too young.



4.3 On-line local searchorDtimum 95

0 0 0

I I I I t

Figure 4.1. The tree of ju mp sequences. A path labeled 0 corresponds to no jump, while 1

corresponds to a jump.

d ) Assure th a t sequences are not cut off immediately after they are born:cut off the least probable sequences amo ng those that are older thana certa in m inim um lafe-length,until only M are left.

The last two restrictions are optional, but might be useful in some cases.

The most important design parameter is the number of filters. It will beshown in a more general framework in Appendix 7.A that the exact ML esti-mate is computed using M N Th at is, th e algorithm has quadratic ratherthan exponential complexity in th e da ta size, which would be the consequenceof any straightforward approach. I t should be noted, as argued in Chapter7, th a t all design parame ters in the algorithm can be assigned good defaultvalues a priori , given only t he signal model.

4.3.2. A simulation example

Consider the change in the mean signal in Figure 4.2. There are three abruptchanges of magnitudes 2,3,4, respectively. Th e noise is white and Gaussianwith variance 1. The local search algorithm is applied to minimize the max-




12Measurem ents and real parameters

Figure 4 2 A change in the mean signal with three abrupt changes of increasing magnitude.

imum likelihood criterion with known noise variance in (4.10). Th e designparameters are M parallel filters with life-length M 4 and minimum seg-

ment length 0. Th e plots in Figure 4.3 show th e cases M 5 and M 8,respectively. Th e plot mimics Figure 4.1 but is ‘upside down’. Each line rep-resents one hypothesis and shows how th e number of change points evolves fortha t hypothesis. Branches th at are cut off have an open end, and the mostlikely branch at each time instant is marked with a circle. This branch splitsinto a new change hypothesis. The upper plot shows th a t, in th e beginning,there is one filter tha t performs best and the other filters are used to evaluatechange points at each time instant. After having lived for three samples with-out becoming th e most likely, they are cut off and a new one is started. Attime 22 one filter reacts and at tim e 23 the correct change time is found. Afterthe last change, it takes three samples until the correct hypothesis becomesthe most likely.

Unfortunately, the first change is not found with M 5. It takes morethan 3 samples to prove a small change in mean. Therefore, M is increased to8 in the lower plot. Apparently, six samples are needed to prove a change attime 10.

Finally, we compute the exact ML estimate using M 4 N (using aresult from Appendix 7.A). Figure 4.4 shows how th e hypotheses examine allbranches that need to be considered.




Local search with M=5

I l I0 5 10 15 2 25 30 35 40

Time [sampleLocal search withll\rl=8

l

0 5 10 15 2 25 30 35 40

Time [sample]

Figure 4.3. Evolution of change hypotheses for a local search with M 5 and M 8,respectively. A small offset is added to th e nu mb er of change points for each hypothesis.Each line corresponds to one hypothesis an d the one marked with ‘0’ s the most likely atthis time. By increasing the num ber of parallel filters, th e search becomes more efficient,an d th e first and smallest change at sample 10 can be found.

6

5 1

11

o 5 Ib 15 i0 25 d 35 doTime [sample]

Figure 4.4. Evolution of change hypotheses for a local search with M 40 A small offsetis added to the number of change points for each hypothesis.



98 Off-line amro ach es

4.4. Off line global search for optimum

There are essentially two possibilities t o find the op tim um of a loss function

maxV(kn)k n

from N measurements:

0 Gradient based methods where small changes in th e current estimate k nare evaluated.

0 MCMC based methods, where random change points are computed.

4.4.1. l o c al minima

We discuss here why any global and numerical algorithm will suffer from localminima.

Example 4 7 Local minimum for one change point

The signal in Figure 4.5 has a change a t th e end of th e da ta sequence.Assume tha t we model this noisefree signal as a change in th e mean modelwith Gaussian noise. Assume we sta rt with n 1 and want to find the bestpossible change point. If the initial estimate of k is small, the likelihood in

Figure 4.5 shows that we will converge to k 0, which means no change

This example shows th at we must d o an exhaustive search for the changepoint. However, this might not improve the likelihood if there are two changepoints as the following example demonstrates.

Example 4 2 Local minimum for two c hang e points

The signal in Figure 4 6 is constant and zero, except for a small segment

in the middle of the da ta record. The global minimum of the likelihood isattained for k n (100,110). However, there is no single change point whichwill improve the null hypotheses, since ogp(k ) > ogp(k O)

Such an example might motivate an approach where a complete search ofone and two change points is performed. This will work in most cases, bu tas the last example shows, it is not guaranteed to work. Furthe rmore, thecomputational complexity will be cubic in time; there are (F) N N 1) /2pairs ( k ~ ,2 ) to evaluate (plus N single change points), and each sequence

requires a filter to be run, which in itself is linear in time. Th e local search inAlgorithm 4.1 will find the opt imum with N 2 / 2 filter recursions (there are inaverage N / 2 filters, each requiring N recursions).



4.4 Off-line alobal search for oDtimum 99

2

Signal

-10 20 40 60 80 100

20Likelihood for k=l :N

c

0 20 40 60 80 100Jump time

Figure 4.5. Upper plot: signal. Lower plot: negative log likelihood p ( k ) with global mini-mum at k 100 but local minimum at k 0.

Figure 4 6

mum at k1 1 2 , . . . 1 N

Signal

Ib 50 10050 200

Marginalized likelihood or k=O, k=O:N and ~= (1 00 ,1 10 )20

-log(p(k =l:N))

-log(p(kn=(1OO,110))) I101

50 10050 200Jump time

Upper plot: signal. Lower plot: negative log likelihood p ( k ) with global mini-

(100,110) but local minimum at k 0. N o improvement for k m m




Example 4 3 Global search using one and two chang e points

Consider a change in the mean signal with two abru pt changes as shownin Figure 4.7. The likelihood as a function of two change times shows th at th elarger change at time 20 is more significant (the likelihood is narrower in onedirection). Note that the likelihood is a symmetric function in cl and Ic2.

In the example above, a two-dimensional exhaustive search finds the truemaximum of th e likelihood function. However, this does not mply th at acomplete search over all combinations of one or two changes would find thetrue change times generally.

Example 4 4 Counterexample of convergence of global search

Consider a signal with only two non-zero elements + A and - A , respec-tively, surrounded by M zeros at each side. See Figure 4.8 for an example.Assume the following exponential distribution on the noise:

Then the negative log likelihood for no change point is

ogp(y1kn 0) c + 2 J A .

As usual, k n 0 is the notation for no change. The best way to place twochange points is to pick up one of the spikes, and leave th e othe r one as 'noise'.

1 0 ,Est imated so l id ) and real do t ted ) parameters

5 10505 30Sampel number

0 0

( 4 (b)

Figure 4.7. A signal with two changes with different amplitu des (a), and the likelihood asa function of k l and kz (b).




2

1.5

1 -

0.5

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0.5

-1

-1.5-

-2 0 5 10 15 20

Figure 4.8. The signal in t he counter example with M 10.

The estimated mean in the segment th a t is left with a spike is then A / M . Theresiduals in two of t he segments are identically zero, so their likelihood vanish(except for a con stan t). The ikelihood for the whole da ta set is thus essentiallythe likelihood for the segment with a spike. For instance, we have

ogp(yIkn ( M ,A4 +1)) 3C A A4 ) 1 + )d./<That is, its negative log likelihood is larger tha n th e ikelihood given no changeat all. Thus, we will never find the global optimum by trying all combinationsof one and two change points. In this example, we have to make a completesearch for three change points.

4.4.2. An MCMC approachMarkov Chain Monte Carlo ( M C M C ) approaches are surveyed in Chapter12 and applied to particular models here and in Chapters 7 and 10. TheMCMC algorithm proposed in Fitzgerald et al. (1994) for signal estimation isa combination of Gibbs sampling and the Metropolis algorithm. Th e algorithmbelow is based solely on the knowledge of the likelihood function for da ta givena certain segmentation.

It is the last step in the algorithm, where a random rejection sampling is

applied, which defines the Metropolis algorithm: the candidate will be rejectedwith large probability if its value is unlikely.




Algorithm 4 2 MCMC signal segmentation

Decide the number of changes n , then:

1. Iterate Monte Carlo run i .

2. Ite rate Gibbs sampler for component j in k n , where a random numberfrom - (kj1k; except 5 )

is taken. Denote the new candidate sequence p he distribu tion maybe taken as flat, or Gaussian centered around the previous estimate.

3. The candidate j is accepted with probability

That is, the candidate sequence is always accepted if it increases thelikelihood.

After the burn-in time (convergence), the dist ribution of change times can becomputed by Monte Carlo techniques.

Note that there are no design parameters at all, except the number of changesan d that one has to decide what the burn-in time is.

Example 4 5 MCMC search for two chang e points

Consider again the signal in Figure 4.7. The Algorithm 4.2 provides ahistogram over estimated change times as shown in Figure 4.9. Th e left plotshows the considered jump sequence in each MC iteration. Th e burn-in time is

about 60 iterations here. After th is, the change at time 20 is very significant,while the smaller change at time 10 is sometimes estimated as 9.

4.5. Change point estimation

The detection problem is often recognized as change point est imation in thestatistical literature. The assumption is th at the mean of a white stochasticprocess changes at time k under H l k ) :



4.5 Chanae Dointstimation 1033l o o t

00 200 300 400

Iteration num ber500

O5

55 3

Figure 4.9. Result of the M C M C Algorithm 4.2. Left p lot (a) shows th e examined ju mpsequence in each itera tion. The best encountered sequence s marked with dashed lines. Theright plot (b ) shows an histogram over all considered jum p sequences. The burn-in time isnot excluded in the histogram.

Th e following summary is based on he survey paper Sen and Srivastava(1975), where different procedures to test H0 to H1 are described. The meth-ods are off-line and only one change point may exist in the da ta. The Bayesian

and likelihood methods are closely related to the lready described algorithms.However, the non-parametric approaches below are interesting and unique forthis problem.

Th e following sub-problems are considered:

P1: 81 > 80 where 190 is unknown.P2: 81 > 80 where 130 0 is known.P3: 81 80 where 190 is unknown.P4: 81 80 where 130 0 is known.

4.5.1. The Bayesian approach

A Bayesian approach, where the prior probability of all hypothesis are thesame, gives:

N

t=2

N

t=2




~ N - l N

L k=l=k+l

where tj is the sample mean of yt. If g > h, where h is a pre-specified threshold,one possible estimate of the jump time (change point) is given by

N

p 3 ,if arg max ytl

I < k < N N kt=k+l

for P3 and similarly for P4.

4.5.2. The maximum ikelihood approach

Using the M L method the test statistics are as follows:

P I g y L maxY k + l , N Y l , k

k d k - l + ( N k ) - lP2

p 3 g F L = m a xY k + l , N Y l , k ) 2

k k - l + (N k ) - l

where ym,n yt. If H 1 is decided, the umpimestimate isgiven by replacing maxk by arg maxk.

1 n

4.5.3. A non parametric approach

The Bayesian and maximum likelihood approaches presumed a Gaussian dis-tribution for th e noise. Non-parametric tests for the first problem, assumingonly whiteness, are based on the decision rule:



4.5 Chanae Dointstimation 105

where the distance measure s i is one of the following ones:

N

S ign(yt med(y))t=lc+l

N N

t=lc+l m = l

Here, med denotes th e median and sign is the sign function. Th e first methodis a variant of a sign test , while the other one is a n example of ordered statistics.These are a kind of whiteness tes t, based on the dea th a t under H o , yt is largerthan its mean with probability 50%. Determining expectation and variance

of s i is a stan dard probability theory problem. For instance, the distributionof S; is hyper-geometric under Ho. Again, if H1 is decided th e ju mp tim eestimate is given by the maximizing argument.

Example 4 6 Chang e point estimation

To get a feeling of th e different test statistics, we compare the sta tis tic sfor the formulation P1. Two signals are simu lated, one which is zero andone which undergoes an ab rupt change in the middle of the da ta ( k 50)

from zero to one. Both signals have white Gaussian measurement noise withvariance 1 added. Figure 4.10 shows th a t all methods give a rather clear peakfor the abruptly changing data (though the Bayesian statistic seems to have

Bayesl

Maximum LikelihoodI

Non-parametric I Non-parametric II

10

52

50 100 0 50 100

Figure 4.10. Comparison of four change point estimat ion algo rithm for two signals: onewith an abrupt change at time 50, and one without.




some probability of over-estimating th e change time), and that there shouldbe no problem in designing a good threshold.

The explicit formulas given in this section, using the maximum likelihoodand Bayesian approaches, should in certain problem formulations come out asspecial cases of the more general formulas in Section 4.2, which can be verifiedby the reader.

4 6 Applications

4.6.1. Photon emissions

The signal model for photon emissions in Section 2.1.3 is

Here E(et) 0 so et is white noise. The re is no problem in modifying thelikelihood based algorithms with respect to any distribution of th e noise. Th eexponential distribution has the nice property of offering explicit and compactexpressions for th e likelihood after marginalization ( L) or maximization(MGL). The standard algorithm assumes Gaussian noise, but can still beused with a good result, hough. This is an illustration of the robustnessof the algorithms with respect to incorrectly modeled noise distribution. Tosave time, and to improve the model, the time for 100 arrivals is used as themeasurement yt in the model. A processed signal sample is, thus, a sum of

55 200 250Time [sampels]

Figure 4 11 Th e noise variance in the photon data is estimated by applying a forgettingfactor low-pass filter with X 0 85 to th e measured arrival rates. The stationar y valuesindicate a variance of slightly less than 1.



4.6 Amlications 107

Figure 4 12

Signal and segmentation

100 150 20050 50 100 150 2 0 050

Time [sampels] Time [sampels]

( 4 (b)

In (a) he ime differences between 100 arrivals is shown together with asegmentation. The quantity is a scaled version of the mea n ime between arrivals (theinte nsit y). In ( b) the ecursively estimated and segmented values of the intensity parameterare shown, with a low-pass filtered version of the arrival times.

100 exponentially distributed variables, which can be well approximated byaGaussian variable according to th e central imit heorem. To design thedetector, a preliminary da ta analysis is performed. An adaptive filter with

forgetting factor X 0.85 is applied to yt. The squared residuals from thisfilter are filtered with he same ilter to get an idea of the measurement noise R.Th e result is shown in Figure 4.11. Clearly, there a re non-stationarities in thesignal th at should be discarded at this step, and the stationary alues seems tobe slightly smaller than 1. Now, the filter bank ML algorithm is applied withM 3 parallel filter and an assumed measurement noise variance of R 1.Figure 4.12 shows th at a quite realistic result is obtained, and this algorithmmight be used in real-time automatic surveillance.

4.6.2. Altitude sensor quality

The altitude estimate n an aircraft is computed mainly by deadreckoning usingthe inertia l navigation system. To avoid drift in the altitude, the estimato ris stabilized by air pressure sensors in pivot tubes as a barometric altitudesensor. This gets increased variance when Mach 1 is passed, as explained inSection 2.2.1. One problem s to detect the critical regions of variance increasesfrom measured data. Figure 4.13 shows the errors from a calibrated (no bias)barometric sensor, and low-pass filtered squared errors. Here a forgettingfactor algorithm has been used with X 0 97

Figure 4.13 shows the same ow-pass filtered variance estimate (in logarith-mic scale) and the result from the ML variance segmentation algorithm. Th e



Part Ill: Parameter estimation





Adaptive filtering5.1.Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.2. Signal mode ls . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.2.1. Linear regression models . . . . . . . . . . . . . . . . . . . 1155.2.2. Pseudo-linear regression models . . . . . . . . . . . . . . . 1195.3. System identification . . . . . . . . . . . . . . . . . . . . 121

5.3.1. Stochastic and deterministic least squares . . . . . . . . . 1215.3.2. Model stru cture selection . . . . . . . . . . . . . . . . . . 1245.3.3. Steepest descent minimization . . . . . . . . . . . . . . . 1265.3.4. Newton-Raphson minimization . . . . . . . . . . . . . . . 1275.3.5. Gauss-Newton minimization . . . . . . . . . . . . . . . . 128

5.4.Adaptive lgorithms . . . . . . . . . . . . . . . . . . . . 133

5.4.1. LMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1345.4.2. RLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385.4.3. Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . 1425.4.4. Connections and optimal simulation . . . . . . . . . . . . 143

5.5.Performance analysis . . . . . . . . . . . . . . . . . . . . 144

5.5.1. LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.5.2. RLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.5.3. Algorithm optimization . . . . . . . . . . . . . . . . . . . 147

5.6. Whiteness based change detection . . . . . . . . . . . . 48

5.7. A simulation example . . . . . . . . . . . . . . . . . . . . 149

5.7.1. Time-invariant AR model . . . . . . . . . . . . . . . . . . 1505.7.2. Abruptly changing AR model . . . . . . . . . . . . . . . . 1505.7.3. Time-varying AR model . . . . . . . . . . . . . . . . . . . 151

5.8. Adap tive filters in communication . . . . . . . . . . . . 153

5.8.1. Linear equalization . . . . . . . . . . . . . . . . . . . . . . 1555.8.2. Decision feedback equalization . . . . . . . . . . . . . . . 1585.8.3. Equalization using the Viterbi algorithm . . . . . . . . . . 1605.8.4. Channel estima tion n equalization . . . . . . . . . . . . . 1635.8.5. Blind qualization . . . . . . . . . . . . . . . . . . . . . . 165

5.9.Noise ancelation . . . . . . . . . . . . . . . . . . . . . . 167

5.9.1. Feed-forward dynamics . . . . . . . . . . . . . . . . . . . . 1675.9.2. Feedback dynamics . . . . . . . . . . . . . . . . . . . . . . 171





114 AdaDtive

5.10. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 173

5.10.1.Human EEG . . . . . . . . . . . . . . . . . . . . . . . . . 1735.10.2.DCmotor . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

5.10 .3. Friction stimation . . . . . . . . . . . . . . . . . . . . . . 1755.11. Speech coding in GSM . . . . . . . . . . . . . . . . . . . 85

5.A. Square root mplementation . . . . . . . . . . . . . . . . 189

5.B. Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . 190

5.B.1. Derivationof LS algorithms . . . . . . . . . . . . . . . . . 1915.B.2. Comparingon-line and off-lineexpressions . . . . . . . . . 1935.B.3.Asymptoticxpressions . . . . . . . . . . . . . . . . . . . 1995.B.4. Derivationof marginalization . . . . . . . . . . . . . . . . 200

5.1. Basics

The signal model in this chapter is, in its most general form,

Th e noise is here assumed white with variance X, and will sometimes be re-stricted to be Gaussian. Th e last expression is in a polynomial form, whereasG ,H are filters. Time-variability is modeled by time-varying parameters Bt .

The adaptive filtering problem is to estimate these parameters y an adaptivefilter,

&+l = Qt + Kt%

where E t is an application dependent error from th e model.

archetypical applications will be presented:We point out part icular cases of (5.1) of special interest, but first , three

Consider first Figure 5.l (a ). Th e main approach to sys tem ident i f ica t ionis to run a model in parallel with the true system, and the goal is to getF ( B ) M G. See Section 5 .3 .

The radio channel in a digital communication system is well describedby a filter G(q; ) . An important problem is to find an inverse filter,and this problem is depicted in Figure 5. l(b ). Th is is also known as the

inverse system denti j ication problem. It is often necessary to includean overall delay. Th e goal is to get F(B)G M 4-O. In equalization, afeed-forward signal qPDut, alled training sequence, is available during



5.2 Sianal models 115

a learning phase. In blind equalization, no training signal is available.Th e delay, as well as the order of t he equalizer are design parameters.Both equalization and blind equalization are treated in Section 5.8.

The noise cancelation, or Acous t ic Echo Cancela t ion (AEC),problem inFigure 5.l (c ) is to remove th e noise component in y = s v by making useof an external sensor measuring the disturbance U in v = Gu. The goalis to get F ( 0 ) M G , so that d M S. This problem is identical to systemidentification, which can be realized by redrawing the block diagram.However, there are some particular twists unique for noise cancelation.See Section 5.9.

Literature

There are many books covering the area of adaptive filtering. Among thosemost cited, we mention Alexander (1986), Bellanger (1988), Benveniste et al.(1987b), Cowan and Grant (1985), Goodwin and Sin (1984), Hayes (1996),Haykin (1996), C.R. Johnson (1988), Ljung and Soderstrom (1983), Mulgrewand Cowan (1988), Treichler et al. (1987), Widrow and Stearns (1985) andYoung (1984).

Survey papers of general interest are Glentis et al. 1999), Sayed andKailath (1994) and Shynk (1989). Concerning th e applications, system iden-tification is described in Johansson (1993), Ljung (1999) and Soderstromand Stoica (1989), equalization in the books Gardner (1993), Haykin (1994),Proakis (1995) and survey paper Treichler et al. (1996), and finally acousticecho cancelation in the survey papers Breining et al. (1999), Elliott and Nelson(1993) and Youhong and Morris (1999).

5.2. Signal modelsIn Chapter 2, we have seen a number of applications tha t can be recast t oestimating the parameters in linear regression models. This section summa-rizes more systematically th e different special cases of linear regressions andpossible extensions.

5.2.1. lin ear regression models

We here point out some common special cases of th e general filter structure(5.1) that can be modeled as linear regression models, characterized by aregression vector V t and a parame ter vector 19. The linear regression is defined



116 AdaDtive

(a) Syst em identification. The goal is t o get a perfectsystem model F ( 0 ) = G.

Training

(b ) Equalization. The goal is to get a perfect channel

inverse F ( 0 ) = GP' , in which case the tra nsm itt edinformation is perfectly recovered.

(c) Noise cancelation. The goal is t o get a perfect model F ( 0 ) = G ofth e acoustic pa th from disturbance t o listener.

Figure 5.1. Adaptive filtering applications.




The measurement yt is assumed to be scalar, and a possible extension to themulti-variable case is given a t th e end of this section.

The most common model in communication applications is the Fini te Im-pu lse Response ( F I R )model:

To explicitely include the mode l order , FIR(n) s a standa rd shorthand nota-tion. It is natural to use this model for communication channels, where echoesgive rise to the dynamics. It is also t he dominating model structure in real-time signal processing applications, such as equalization and noise cancelling.

Example 5.7 Multi-path ading

In mobile communications, multi-path fading is caused by reflections, orechoes, in the environment, This specular multi-path is illustrated in Figure5.2. Depending upon where the reflections occur, we get different phenomena:

Local scattering occurs near th e receiver. Here the difference in arrivaltime of the different rays is less than the symbol period, which meansth at no dynamic model can describe th e phenomenon in discrete time.Instead , the envelope of th e received signal is modeled as a stochasticvariable with Rayleigh distribution or Rice distribution. The former dis-

tribution arises when the receiver is completely shielded from the trans-mitter, while the lat ter includes th e effect of a stronger direct ray. Th edynamical effects of this 'channel' are much faster than the symbol fre-quency and imply a distortion of th e waveform. This phenomonen iscalled frequenc y selective fading.

Near-field scattering occurs at intermediate distance between the trans-mitter and receiver. Here the difference in arrival time of th e differentrays is larger than the symbol period, so a discrete time echo model canbe used to model the dynamic behaviour. First, in continuous time thescattering can be modeled as



118 AdaDtive

Figure 5.2. Multi-path fading is caused by reflections in environment.

where are real-valued time delays rather than multiples of a sampleinterval. This becomes a FIR model after sampling to discrete time.

Far-field scattering occurs close to the transmi tter . The eceived rays canbe treated as just one ray.

Good surveys of multi-path fading are Ahlin and Zander (1998) and Sklar(1997), while identification of such a radio channel is described in Newson andMulgrew (1994).

For modeling time series, an Auto-Regressive ( A R ) model is often used:

(Pt = (- l, , Yt-n)

l9 = (a1, a 2 , . . . ,a J T .

T

(5.9)

(5.10)

(5.11)

AR n) is a shorthand notation. This is a flexible structure for many real-worldsignals like speech signals (Section 2.4.4), seismical data (Section 2.4.3) andbiological data (Sections 2.4.1 and 2.4.2). One particular application is to usethe model for spectral analysis as an a lternative t o transform based methods.

Example 5.2 Speech modeling

Speech is generated in three different ways. Voiced sound, like all vowelsand ‘m’, is originating in the vocal chord. In signal processing terms, the vocalcord generates pulses which are modulated n the throat and mouth. Unvoiced




sound, like ’S’ and ’v’, is a modulated air stream, where t he air pressure fromthe lungs can be modeled as white noise. Implosive sound, like ‘k’ and ‘b’, isgenerated by building up an air pressure which is suddenly released.

In all three cases, the huma n vocal system can be modeled as a series ofcylinders and an excitation source ( th e ‘noise’ et) which is either a pulse train,white noise or a pulse. Each cylinder can be represented by a second order ARmodel, which leads to a physical motivation of why AR models are suitable forspeech analysis and modeling. Time-variability in the param eters s explainedby the fact that the speaker is continuously changing he geometry of th e vocaltract.

In control and adaptive control, where there is a known control signalU available and where e represents measurement noise, the Auto-Regressivemodel with eXogenous input ( A R X ) s common:

(5.12)

(5.13)

ARX(n,,nb,n,) is a compact shorthand notation. This structure does notfollow in a straightforward way from physical modeling, bu t is rather a richstructure whose main advantage is that there are simple estimation algorithmsfor it.

5.2.2. Pseudo-linear regression modelsIn system modeling, physical a rguments often lead t o th e deterministic signalpart of the measurements being expressed as a linear filter,

The main difference of commonly used model structu res is how and where thenoise enters the system. Possible model struc tures, that do not exactly fit thelinear regression framework, are ARMA, OE and ARMAX models. These canbe expressed as a pseudo-linear regression, where th e regressor p t ( 8 ) dependson the parameter.



120 AdaDtive

The AR model has certain shortcomings for some other real world signalsthat are less resonant. Then th e Auto-RegressiveMovingAv e r a g e ( A R M A )model might be better suited,

(5.16)

The Output Error ( O E )model, which is of th e Infinite Impulse Response (IIR)type, is defined as additive noise to th e signal part

(5.19)

H ( q ; 6 )= 1 (5.20)

Pt(6) = (-Yt-1 + e t - l , . . . , yt-,, +et-,,, ut-1 , . . . , ~ t - , ~ ) ~ 5.21)

6 = f l , 2 , . . . , n,, h , 2,. . . , , y . (5.22)

Note th at th e regressor contains t he noise-free outpu t, which can be writtenyt et . That is, the noise never enters the dynamics. The OE models follow

naturally from physical modeling of systems, assuming only measurement noiseas stochastic disturbance.

For modeling systems where the measurement noise is not white bu t stillmore correlated than that described by an ARX model, a n Auto-RegressiveMoving Average model wi th eXogenous input ( A R M A X )model is often used:

(5.23)

(5.24)

This model has found a stan dard application in adaptive control.

be written as a pseudo-linear regressionThe common theme in ARMA, ARMAX and OE models is that they can

Yt = P t ( Q ) Q

+e t , (5.27)

where the regressor depends on the true parameters. The parameter depen-dence comes from the fact th at th e regressor is a function of the noise. For an



5.3 Svstem identification 121

ARMA model, the regressor in (5.17) contains et, which can be computed as

and similarly for ARMAX and OE models.

of the noise. That is, replace et with the residuals E t ,

The natural approximation is to ju st plug in th e latest possible estimate

4 4 ; )et = E t =C(q; )yt,

This is the approach in the extended least squares algorithm described in the

next section. The adaptive algorithms and change detectors developed in thesequel are mainly discussed with respect to linear regression models. However,they can be applied to OE, ARMA and ARMAX as well, with the approxi-mation that the noise et is replaced by th e residuals.

Multi-InputMult i-Output (MIMO) models are usually considered to bebuilt up as ng X nu independent models, where ny = dim(y) and nu = dim(u),one from each input o each output . MIMO adaptive ilters can hus beconsidered as a two-dimensional array of Single-Input Single-Output (SISO)adapt ive filters.

5.3. System dentification

This section overviews and gives some examples of optimization algorithmsused in system identification in general. As it turns out , these algorithms arefundamental for the understand ing and derivation of adaptive algorithms aswell.

5.3.1. Stochastic and deterministic east squares

The algorithms will be derived from a minimization problem. Let

4 0 ) = Yt t = Yt p p . (5.28)

Least squares optimization aims at minimizing a quadratic loss function V (e ),

8 = argminV(0).e

Th e (generally unsolvable) adaptive filtering problem can e sta ted as min-imizing th e loss function

v(e)= ;(e) (5.29)




The normal equations are found by differentiation,

and the minimizing argument is thus

(5.33)

It is here assumed that the parameters are time-invariant, so the question ishow to generalize the es tim ate to the ime-varying case.

Example 5.4 Deterministic least squares solution for FIR model

For a second order FIR model, (5.33) becomes

where the estimated covariances are defined as

. t

Note the similarity between stochastic and de terministic east squares. In thelimit t + 0 we have convergence 8, + b * under mild conditions.

Example 5.5 AR estimation for rat €€G

Consider the rat EEG in Section 2.4.1, also shown in Figure 5.3. Th e leastsquares parameter estimate for an AR(2) model is

e,,,,= (-0.85, 0.40)T,

corresponding to two complex conjugated poles in 0.43 0.47. Th e leastsquares loss function is V ( @ = 1.91, which can be interpreted as the energyin the model noise et. This figure should be compared t o th e energy in thesignal itself, that is the loss function without model, V ( 0 ) = 3.60. This means

that the model can explain roughly half of the energy in the signal.We can evaluate the least squares estimate at any time. Figure 5.3 shows

how the estimate converges. This plot must not be confused with the adaptive



1 2 4 AdaDtive

10, l I ’ I

-10‘ I f I0 500 1000 1500 2000 2500 3000 500 4000 4500

-0 500 1000 1500 000 2500 3000 3500 4000 4500Time [samples]

Figure 5.3. Rat EEG and estimated parameters of an AR(2) model for each time instant.

algorithms later on, since the re is no forgetting of old information here. Ifwe try a higher order model, say AR(4 ), the loss function only decreases toV ( @ 1.86. This means tha t it hardly pays off to use higher order models inthis example.

5.3.2. Model structure selection

The last comment in Example 5.5 generalizes to an important roblem: whichis the best model order for a given signal? One of th e most important conclu-sions from signal modeling, also valid for change detection and segmentation,is that the more free parameters in the model, the bet ter fit. In the exampleabove, the loss function decreases when going from AR(2) to AR(4), but notsignificantly. That is the engineering problem: increase the model order until

the loss function does not decrease signijicantly.There are several formal at temp ts to try to et a n objective measure of fit.

All these can be interpreted as th e least squares loss function plus a penaltyte rm, tha t penalizes the number of parameters. This is in accordance with theparsimonious p rinciple (or Ockam ’s razor after a greek philosoph). We haveencountered this problem in Chapter 4, and a few penalty terms were listed inSection 4.2.2. These and some more approaches are summarized below, whered denotes the model order:

Akaike’s Final Prediction Error (FPE) (Akaike, 1971; Davisson, 1965):

1 +d / Nd = arg min Vjv ( d )d 1 d / N ‘




Akaike’s Informat ion Cri te r ion A (AIC) (Akaike, 1969):

2dd = argminlog(VN(d)) + -.d N

This is asymptotically the same as FP E, which is easily realized by takingthe logarithm of FPE .

The asymptotically equivalent criteria Akaike’s Informat ionCri te r ionB (BIC) (Akaike, 1977), Rissanen’s minimum descript ion length ( M D L )approach (Rissanen, 1989), see Section 12.3.1, and Schwartzcriterion(Schwartz, 1978).

2d log N )d = argminlog(VN(d)) +d N

Mallow’s C, criterion (Mallows, 1973) is

d = a r g m i n W + + d - N ,R

which assumes known noise variance R.

For time series with few data points, say 10-20, the aforementioned ap-proaches do not work very well, since they are based on asymptoticarguments. In the field of econometrics, refined criteria have appeared.The corrected AIC (Hurvich and Tsai, 1989) is

2d 2(d + l ) ( d + 2)d = argminlog(VN(d)) +d N N - d - 2

The criterion (Hannan and Quinn, 1979)

d = arg min log(VN(d)) + 2d log logd N .

FPE and AIC tend to over-estimate the model order, while BIC and MDLare consistent. That is, if we simulate a model an d then try to find its modelorder, BIC will find it when the number of data N tends to nfinity with prob-ability one. The @ criterion is also consistent. A somewhat different approach,yielding a consistent estimator of d , is based on the Predictive Least Squares( P L S ) (Rissanen, 1986). Here the unnormalized sum of squared residuals is

used: N



126 AdaDtive

where m is a design parameter t o exclude th e transient. Compare this to th estan dard loss function, where the final estimate is used. Using (5.96), the sumof squared residuals can be written as

which is a smaller number than PLS suggests. This difference makes PLSparsimonious. Consistency and asymptotic equality with BIC are proven inWei (1992).

5.3.3. Steepest descent minimizationThe steepest descent algorithm is defined by

(5.34)

Hence, the estimate is modified in th e direction of the negative gradient. Incase the gradient is approximated using measurements, the algorithm is called

a stochastic gradient algorithm.

Example 5.6 The steepest descent algorithm

Consider the loss function

V ( x ) = X 1 +X l X 2 + 2 2 .2 2

The steepest descent algorithm in (5.34) becomes (replace 8 by X )

The left plot in Figure 5.4 shows the convergence (or learningcurves) fordifferent initializations with p = 0.03 and 100 iterations. A stochastic versionis obtained by adding noise with variance 10 to th e gradient, as illustrated inthe right plot.

This example illustrates how t he algorithm follows the gradient down t othe minimum.




Steepest descent minimization of f(x)=x3xl%+< Stochastic steepest descent minim ization of f(x)=x3xl%+<

Figure 5.4. Deterministic (left) and stochasti c (right) steepest descent algorithms.

5.3.4. Newton-Raphson minimization

The Newton-Raphson algorithm

(5 .35 )

usually has superior convergence properties, compared to the steepest descentalgorithm. Th e price is a computational much more demanding algorithm,where the Hessian needs to be computed and also inverted. Sufficiently closeto the minimum, all loss functions are approximately quadratic functions, andthere the Newton-Raphson algorithm takes step straight to th e minima asillustrated by the example below.

Example 5.7 The Newton-Raphson algorithm

Consider the application of the Newton-Raphson algorithm to Example5.6 under the same premises. Figure 5.5 shows that the algorithm now findsthe closest way to the minimum.

It is interesting t o compare how the Hessian modifies th e step-size. Newton-Raphson takes steps in more equidistant steps, while th e gradient algorithmtakes huge steps where the gradient is large.

Models linear in the parameters (linear regressions) give a quadratic leastsquares loss function, which implies that convergence can be obtained in one



128 AdaDtive

Newton-Raphson minimizationof f(x)=+xlx2+< Stoc hasti c Newto n-Raph son min imizati onof f(x)=+x,x2+x;

Figure 5.5. Deterministic (left) and stochas tic (rig ht) Newton-Raphson algorithms

iteration by Newton-Raphson using p = 1. On the other hand , model struc-ture s corresponding to pseudo-linear regressions can have loss functions withlocal minima, in which case initialization becomes an important mat ter .

Example 5.8 Newton-Raphson with lo cal minima

The function

f X) = x5 6x4 + 6x3 + 20x2 3 8 ~ 20

has a local and a global minimum, as the plot in Figure 5.6 shows. A fewiterations of the Newton-Raphson algorithm (5.35) for initializations x o = 0and zo = 4, respectively, are also illustrated in the plot by circles and stars,respectively.

5.3.5. Gauss-Newton minimization

Hitherto , the discussion holds for general optimization problems. Now, thealgorithms will be applied to model estimation, or system identzjicatzon. No-tationally, we can merge stochastic and de terminis tic least squares by using

V 0 ) = E@) = E(yt p;(e)e)2, (5.36)

where E should be interpreted as a computable approximation to the expec-tation operator instead of (5.30) or an adaptive version of the averaging sumin (5.32), respectively.




6 o f i

f=x5-6*x4+6*x3+20*x2-38*x+20for xO=O 0) nd x0=4 (X)

0'0 0.5 1 1.5 2 2.5 3 3.5

X

Figure 5.6. The Newton-Raphson algorithm applied to a function with several minima.

For generality, we will consider the pseudo-linear regression case. Thegradient and Hessian are:

d V ( d ) = -2E t (Q)e t (Q)dd

The last approximation gives the Gauss-Newton algorithm. Th e approxima-tion is motivated as follows: first , the gradient should be uncorrelated withthe residuals close to th e minimum and not point n any particular direc-tion. Thus, the expectation should be zero. Secondly, the residuals should,with any weighting function, average to something very small compared to th eother term which is a quadratic form.

The gradient t ( Q ) depends upon th e model. One approximation forpseudo-linear models is to use pt(d), which gives the extended least squaresalgorithm. The approximation is to neglect one term in

MO



130 AdaDtive

in the gradient. A related and in many situations better algorithm is therecursive maximum likelihood method given below without comments.

Algorithm 5.7 Gauss-Newton or ARMAX modelsConsider the model

where the C(q) polynomial is assumed to be monial with CO = 1. The Gauss-Newton algorithm is

The extended least squares algorithm usesA

?h = = ( - Y t - 1 , . . . 7 t-'7h,7Ut-1,... ?Ut-'7hb7&1,...T

Ct = Y t - v:(&)&,while the recursive maximum likelihoodmethod uses

Some practical implemenation steps are given below:

A stopping criterion is needed to ab or t he iterations. Usually, thisdecision involves checking the relative change in th e objective functionV and the size of the gradient J.

The step ize p is equal to uni ty in the riginal algorithm. Sometimes thisis a too large a step. For objective functions whose values decrease in thedirection of the gradient for a short while, and then start to increase, ashorter step size is needed. One approach is to always test if V decreasesbefore updating the parameters. If not , the step size is halved, and theprocedure repeated. Another approach is t o always optimize the stepsize. This is a scalar optimization which can be done relatively efficient,and the gain can be a considerable reduction in iterations.




ARMA(l . l ) . th0=[ -0 .5 ;0.51

0.4

log (V)fo rARMA(l , i ) modelI ,

I teration

( 4

Figure 5.7. 10 iterations of Gauss-Newton (a). The ogarithm of the least squares lossfunction with the G N iterations marked as a path (b).

A thorough trea tment of such algorithms is given in Ljung and Soderstrom1983) and Ljung 1999).

A revealing example with only one local minima but with a non-quadraticloss function is presented below.

Example 5.9 Gauss-Newton optimization of an ARMA(7,l) model

The ARMA 1, l ) model below is simulated using Gaussian noise of lengthN = 200:

y ( t ) 0.5y(t 1) = e(t) + 0.5e(t 1).

The Gauss-Newton iterations starting a t th e origin are illustrated both as aniteration plot an d in the level curves of the loss function in Figure .7. The lossfunction is also illustrated as a mesh plot in Figure 5.8, which shows that thereis one global minimum, and th a t th e loss function has a quadratic behaviorlocally. Note that any ARMA model can be restricted to be stable and non-minimum phase, which implies th a t th e intervals [ - l , l] for the parameterscover all possible ARMA 1, l ) models. The non-quadrat ic form f a r from theoptimum explains why th e first few iterations of Gauss-Newton are sensitiveto noise. In this example, the algorithm never reaches th e optimum, which isdue to finite dat a length. The final estimation error decreases with simulationlength.

To end this section, two quite general system identification examples aregiven, where the problem is to adjust the parameters of given ordinary differ-ential equations to measured data.



132 AdaDtive

logo/) f o r A RM A ( 1 , l )modell o g ( V ) f o r A RM A ( l , l )model

1

-1182

-1 0.282

( 4 (b)

Figure 5.8. The logarithm of the least squares loss function.

Example 5.70 Gauss-Newton identification of electronic nose signals

Consider th e electronic nose for classifying bacteria in Section 2.9.1. Weillustrate here how standard rout ines for Gauss-Newton minimization can be

applied t o general signal processing problems. Recall the signal model for thesensor signals given in predictor form:

A main feature in many standard ackages, is th at we do not have to computethe gradient $t. Th e algorithm can compute it numerically. Th e result of anestimated model to three sensor signals was shown in Figure 2.21.

Example 5.7 1 Gauss-Newton identification of cell phone sales

Consider the sales figures for the cellular phones NMT in Section 2.9.2.The differential equation used as a signal model is

This non-linear differential equat ion can be solved analytically and then dis-cretized to suit he discrete time measurements. However, in many cases,there is either no analytical solution or it is very hard to find. Then one can



5.4 AdaDtives 133

0 20 40 60 8000 120 140

Figure 5.9. Convergence of signal modeling of the N M T 450 sales figures (thick solid line)using the Gauss-Newton algorithm. The ini tial est imate is the thi n solid line, and the n thedash ed thin lines show how each iteration improves t he result. The thick dash ed line showsthe final model.

use a numerical simulation tool to compute th e mapping from parameters topredictions &(e), and then proceed as in Example 5.10.

Figure 5.9 shows how the initial estimate successively converges to a curvevery close to he measurements. A possible problem is local minima. Inthis example, we have to sta rt with a I3 very close to the best values to getconvergence. Here we used I3 = (-0.0001, 0.1, max(y )), using the fact th atthe stationa ry solution must have yt + 193 and then some trial and error forvarying 81, 82. The final parameter estimate is 6 = (0.0011, 0.0444, 1.075)T.

5.4. Adaptive algorithms

System identification by using off-line optimization performs an iterative min-imization of the type

N

t= l

initiated at 6 . Here Kt is either t he gradient of the loss function, or the inverseHessian times the gradient. As a starting point for motivating the adaptivealgorithms, we can think of them as as an iterative minimization where one



1 3 4 AdaDtive

new data point is included in each iteration. That is, we let N = i above.However, th e algorithms will neither become recursive nor truly adaptive bythis method (since they will probably converge to a kind of overall average).

A bet ter try is to use the previous estimate as the starting point in a newminimization,

Taking the limited information in each measurement into account, it is logicalto only make one iteratation per measurement. Tha t is, a generic adaptivealgorithm derived from an off-line method can be written

8, = 8t-1 +pKt t

o o = e ..

^O

Here, only Kt needs to be specified.

5.4.1. LMS

The idea in the Least Mean Square (L M S)algorithm is to apply a steepestdescent algorithm (5.34) to (5.29). Using (5.28), thi s gives th e following algo-rithm.

Algorithm 5.2 LMS

For general linear regression models yt = &+et, the LMS algorithm updatesthe parameter estimate by the recursion

et = 8t-l + Pucpt(Yt P Te , - l > . (5.37)

The design parameter p is a user chosen step-size. A good starting value ofthe step size is p = 0.01/ Std(y).

The algorithm is applied to da ta from a simulated model in the followingexample.

Example 5.72 Adaptive filtering with LMS

Consider an AR(2) model

Y t = -alYt-l a2Yt-2 + et,



5.4 AdaDtives 135

Figure 5.10.

log V) for AR(2) modelc \

- - - r - - - I \I .

al1.5 1

a2

Convergence of log(Vt) to th e global opt imum using LMSonvergence of log(Vt) to th e global opt imum using LMS

,l , LMSwit ip=O.OInAR(2)model ,

Time800 1000

-0.5

for an AR(2) model.

log(V)and LMS trajectory with ~=0.01 n AR(2) model

( 4 (b)

Figure 5.11. Convergence of LMS for an AR(2) model averaged over 25 Monte Carlo sim-ulations and illustrated as a time plot (a) and in the loss function’s level curves (b).

simulated with a1 = 0.8 and a2 = 0.16 (two poles in 0.4). Figure 5.10 showsthe logarithm of the least squares loss function as a function f the parameters.The LMS algorithm with a step size p = 0.01 is applied to 1000 data itemsand the parameter estimate s are averaged over 25 Monte Carlo simulations.

Figure 5.11(a) shows the parameter convergence as a function of time, andFigure 5.11(b) convergence in the level curves of th e loss function.



136 AdaDtive

There are certain variants of LMS. Th e N o rm a l i z e d LM S (N LM S)s

e t = L 1 + PtVt(Yt V F L ) , (5.38)

where

(5.39)

and a is a small number close to th e machine precision. Th e main advantageof NLMS is th a t i t gives simpler design rules and stabilizes the algori thm incase of energy increases in pt. The choice p = 0.01 should always give astable algorithm independent of model structure and parameter scalings. Aninterpretation of NLMS is that it uses the a posteriori residual in LMS:

et = 6t-l + PVt(Yt p: ). (5.40)

This formulation is implicit, since the new parameter estimate is found onboth sides. Other proposed variants of LMS include:

e The leaky LMSalgorithm regularizes the solution towards zero, in orderto avoid instability in case of poor excitation:

(5.41)

Here 0 < y << 1 forces unexcited modes to approach zero.

e The sign-erroralgorithm where the residual is replaced by its sign,sign(&t). The dea is to choose the step-size as a power of two p = 2Tk,so that th e multiplications in 2-’psign(et) can be implemented as datashifts. In a DSP application, only additions are needed to implement.An interesting interpretation is tha t th is is th e stochastic gradient algo-rithm for the loss function V ( 0 )= Eletl. This is a criterion th a t is morerobust to outliers.

e The sign da ta algorithm where pt is replaced by sign(pt) (component-wise sign) is another way t o avoid multiplications. However, th e gradientis now changed and convergence properties are influenced.

e The sign-sign algorithm:

e t = psign(pt) sign(yt p: ). (5.42)

This algorithm is extremely simple t o implement n hardware, whichmakes it interesting in practical situations where speed and hardwareresources are critical parameters. For example, it is a part in the CCITTstandard for 32 kbps modulation scheme ADPCM (adaptive pulse codemodulation).



5.4 AdaDtives 137

Figure 5.12. Averaging of a stochastic gradient algorithm implies asymptotically the samecovariance matrix as for LS.

Variable step size algorithms. Choices based on p ( t ) = p / t are logicalapproximations of th e LS solution. In th e case of time-invariant param-eters, the LMS estimate will then converge. This type of algorithm is

sometimes referred to as a stochastic gradient algorithm.

Filtered regressors are often used in noise cancelation applications.

Many modifications of the basic algorithm have been suggested to getcomputational efficiency (Chen et al., 1999).

There are some interesting recent contributions to stochastic gradient al-gorithms (Kushner and Yin, 1997). One is based on averaging theory. Firs t,choose the step size of LMS as

The step size decays slower tha n for a stochastic gradient algorithm, where

y = 1. Denote the output of the LMS filter e t . Secondly, the ou tput vector isaveraged,

..

- k = l

The series of linear filters is illustrated n Figure 5.12. I t has been shown(Kushner and Yang, 1995; Polyak and Juditsky, 1992) that th is procedure isasymptotically efficient, in that the covariance matrix will approach th a t of theLS estimate as t goes to infinity. The advan tage s th at th e complexity is Q(&)rather than O ( d 2 t ) .An application of a similar idea of series connection of twolinear filters is given in Wigren (1998). Tracking properties of such algorithmsare examined in Ljung (1994).

One approach o self-tuning is to upda te the tep-size of LMS. The result istwo cross-coupled LMS algorithms relying on a kind of certainty equivalence;each algorithm relies on the fact that the o ther one is working. The gradient



5.4 AdaDtives 139

LS with h=0.99 on AR(2) model

-0.2 ' I200 400 600 800 1000

Time

l q ( V ) and RLS trajectory with h.0.99 on AR(2) model

(a) (b)Figure 5.13. Convergence of RLS for an AR(2) model averaged over 25 Monte Carlo simu-lations and ill ustrated as a time plot (a) and in t he loss function (b).

25 Monte Carlo simulations. Figure 5.13(a) shows the parameter convergenceas a function of time, and Figure 5.13(b) convergence in th e level curves ofthe loss function. To slow down the ransien t, he nit ial PO is chosen to0.112. With a larger value of PO, we will get convergence in the mean after

essentially two samples. It can be noted th at a very large value, say PO=

10012, essentially gives us NLMS (just simplify (5.45)) for a few recursions,until Pt becomes small.

Compared to LMS, RLS gives parameter convergence in the parameterplane as a straighter line rather than a steepest descent curve. The reason fornot being completely straight is the incorrect initialization PO.

As for LMS, there will be practica l problems when the signals are not

exciting. Th e covariance mat rix will become almost singular and the param-eter estimate may diverge. Th e solution is regularization, where the inversecovariance matrix is increased by a small scalar times the unit matrix:

Note that this R: is not the same as the measurement covariance. Anotherproblem is due t o energy changes in the regressor. Speech signals modeled asAR models have this behavior. When t he energy decreases in silent periods,it takes a long time for the matrix R: in (5.47) to adapt. One solution is t ouse the WLS estimator below.



140 AdaDtive

In RLS, expectation in (5.30) is approximated by

t

E Q ) = (1 X)

ct-'C2&

k=-CO

Another idea is to use a finite window instead of an exponential one,

(5.48)

(5.49)

This leads to Wandowed Least Squares ( WLS), which is derived in Section 5.B,

see Lemma 5.1. Basically, WLS applies two updates for each new sample, sothe complexity increases a factor of two. A memory of th e last L measurementsis another drawback.

Example 5.74 Time-frequency analysis

Adaptive filters can be used to analyze the frequency content of a signal asa function of time, in contrast to spectral analysis which is a batch method.Consider the chirp signal which is often used a benchmark example:

2Tt2y t = s i n ( N ) , t = 0 , 1 , 2 . . . N.

Defining momentaneous frequency as = darg(y t)/d t, the Fourier transformof a small neighborhood of t is 4 ~ t / N . Due to aliasing, a sampled versionof th e signal with sample interval 1 will have a folded frequency response,with a maximum frequency of T . Thus, the theoretical transforms assumingcontinuous time and discrete time measurements, respectively, are shown inFigure 5.14. A non-parametric method based on

FFTspectral analysis of

da ta over a sliding window is shown in Figure 5.15(a). As the window sizeL increases, the frequency resolution increases at the cost of decreased timeresolution. This wellknown trade-off is related t o Heisenberg's uncertainty:the product of time and frequency resolution is constant.

The parametric alternative is to use an AR model, which has t he capabil-ity of obtaining better frequency resolution. An AR(2) model is adaptivelyestimated with WLS and L = 20, and Figure 5.15(b) shows th e result. Th efrequency resolution is in theory infinite. The practical l imitation comes fromthe variance error in the parameter estimate. There s a time versus frequencyresolution trade-off for this parametric method as well. Th e larger time win-dow L , the better parameter estimate and thus frequency estimate.



5.4 AdaDtives 141

p10

h 5[

00 1000000 3000 4000 5000

Time [samples]

'0 50 100 1500050 300Time [samples]

Figure 5.14. Time-frequency content (upper plot) of a chirp signal (lower plot) .

0.2

0.1

Figure 5

500000500 2000Tlme [samples]

( 4

50 100 150 200 250 300 350005000Time [bins of 5 samples]

(b)

.15. Time-frequency content of a chirp signal. Non-parametric (left) and parametric(right) methods, where t he lat ter uses an AR(2) model and WLS with L = 20.

The larger AR model, t he more frequency components can be estimated.Tha t is, the model order is another critical design parameter. We also knowthat the more parameters, the larger uncertainty in the estimate, and this

implies another trade-off.



142 AdaDtive

5.4.3. Kalman filter

If the linear regression model is interpreted as th e measurement equation in asta te space model,

+l = et+wt, Cov(vt) = Qt yt = + et,Cov(et) = Rt, (5.50)

then the Kalman filter (see Chapters 13 and 8) applies.

Algorithm 5.4 Kalman filter for inear regressions

For general linear regression models yt = PT& + et, the Kalman filter updatesthe parameter estimate by the recursion

(5.51)

(5.52)

(5.53)

The design parameters are Qt and Rt. Without loss of generality, Rt can betaken as 1 in the case of scalar measurement.

There are different possibilities for how to interpret the physics behind thestate model noise wt:

e A random walk model, where ut is white noise.

e Abrupt changes, where

'Ut=

{0 with probability 1 q

v with probability q , where Cov(v) = i Q t .

e Hidden Markov models, where 0 switches between a finite number ofvalues. Here one has to specify a transition matrix, with probabilitiesfor going from one vector t o another. An example is speech recognition,where each phoneme has its own a priori known parameter vector, andthe transition matrix can be constructed by studying the language tosee how frequent different transitions are.

These three assumptions are all, in way, equivalent up to econd order statis-tics, since Cov(wt) = Qt in all cases. The Kalman filter is th e best possibleconditional linear filter, but there might be be tte r non-linear algorithms for



1 4 4 AdaDtive

5.5. Performance analysis

The error sources for filters in general, and adaptive filters in particular, are:

Transient error caused by incorrect initialization of the algorithms. ForLMS and NLMS it depends on the initial parameter value, and for RLSand KF it also depends on the initial covariance matrix PO. By makingthis very large, the transient can be reduced to a few samples.

Variance error, caused by noise and disturbances. In simulation studies,this term can be reduced by Monte Carlo simulations.

Tracking errors due to parameter changes.

Bias error caused by a model that is not rich enough to describe thetrue signal. Generally, we will denote 8* for th e best possible parametervalue within the considered model structure, and 8' the true parameters

A s i

when available.

iandard design consists in the following steps:

Design of adaptive filters

1. Model structure selection from off-line experiments (for in-stance using BIC type criteria) or prior knowledge to reducethe bias error.

2. Include prior knowledge of the initial values 190 and PO or de-cide what a sufficiently large PO s from knowledge of typicalparameter sizes to minimize the transient error.

3. Tune the filter to trade-off the compromise between tracking

and variance errors.

4. Compare different algorithms with respect to performance,real-time requirements and implementational complexity.

We first define a formal performance measure. Let

~ ( 8 ) E [ ( y t ~ F 1 9 t ) ~ l Vmin + V,,, (5.54)

(5.55)

where



5.5 Performancelvsis 145

assuming that the true system belongs to th e model class. V, is the excessivemean square error. Define the misadjustment:

M ( t ) = v,, + B , s t+oo . (5 .56)Vmin

The assumptions used, for example, in Widrow and Stearns (1985), Bellanger(1988), Gunnarsson and Ljung (1989) and Gunnarsson (1991) are the follow-ing:

Ooexists no biasrror

Q = Eutu? with ut = 0; O parameter changes quasi-stationary process2 = E[cptcpTl cpt quasi-stationary process

e t ,t , (Pt independenthite processes

That is, the true system can be exactly described as the modeled linear re-gression, the regressors are quasi-stationary and the parame ter variations area random walk. We will study the parameter estimation error:

5.5.1. LMS

The parameter error for LMS is

Transient and stability for LMS

As a simple analysis of the transient, take the SVD of 2 = E[cptcpT] = U D U Tand assume time-invariant true parameters 0: = 0'. The matrix D is diagonal

and contains the singular values ai of 2 n descending order, and U satisfiesU T U = I . Let us also assume that 60 = 0:



146 AdaDtive

Th at is, LMS is stable only if

p 2 / 0 1 .

More formally, the analysis shows tha t we get convergence in th e mean if andonly if p < 2/01.

We note that the transient decays as (1 ~ 0 , ~ . f we choose p = 1/01,so the first component converges directly, then the convergence rate is

-) t = ( l -0 1

That is, the possible convergence rate depends on the ondition number of thematrix 2. If possible, th e signals in th e regressors should be pre-whitened to

give cond(2) = 1.A practically computable bound on p can be found by the following rela-tions:

2 2 n* p < - <

01 E(cpTcpt)

Here the expectation is simple t o approximate in an off-line study, or to makeit adaptive by exponential forgetting. Note the similarity t o NLMS. It shouldbe mentioned th at p in practice should be chosen to be abou t 00 times smallerthan this value, to ensure stability.

Misadjustment for LMS

The stationary misadjustment for LMS can be shown to equal:

(5.57)

variance error rack&

The stationary misadjustment M splits into variance and tracking parts.The variance error is proportional t o the ada pta tion gain p while thetracking error is inversely proportional t o th e gain.

Th e tracking error is proportional to th e ignal to noise ratio R .IQII

The optimal step size is



5.5 Performancelvsis 147

5.5.2. RLS

The dynamics for the RLS parameter error is

Misadjustment and transient for RLS

As for LMS, the stationary misadjustment A 4 splits into variance and trackingparts. The transient can be expressed in misadjustment as a function of time,M ( t ) for constant Q . Th e following results can be shown to hold (see thereferences in the beginning of the section):

M

n(1 X) t r (ZQ)M ( t )

2 2(1 X)R .t + l

(5.58)

transient error variance error trackingv

Transient and variance errors are proportional to th e number of param-eters n.

As for LMS, the stationary misadjustment M splits into variance andtracking parts. Again, as for LMS, the variance error is proportional tothe adaptation gain 1 X, while the tracking error is inversely propor-tional t o th e gain.

As for LMS, the tracking error is proportional to the ignal to noise ratioM

R '

By minimizing 5.58) w.r.t. X, the optimal step size 1 X is found to be

A refined analysis of the transien t term in both RLS and LMS (with variants)is given in Eweda (1999).

5.5.3. Algorithm op timization

Note that the stat ionary expressions make it possible to optimize the designparameter to get the best possible trade-off between tracking and varianceerrors, as a function of the true time variability and covariances. For instance,we might ask which algorithm performs best for a certain Q and 2 , n terms of



148 AdaDtive

excessive mean square error. Optimization of step size p and forgetting factorX in the expression for M in NLMS and RLS gives

M;VLMS= t r Z t r QM k L S nt rZQ *

The race operator can be rewritten as he sum of eigenvalues, t r (Q) =C o i ( Q ) . If we put Z = Q, we get

with equality only if oi = oj for all i j . As another example, take Q = Z - l ,

with equality only if oi = u j for all i, j . That is, if Z = Q then NLMS performsbest and if 2 = Q-', then RLS is bet ter and we have by examples proven thatno algorithm is generally better than the other one, see also Eleftheriou andFalconer (1986) and Benveniste et al. (1987b).

5.6. Whiteness based change detection

The basic idea is t o feed the residuals from th e adaptive filter to a changedetector, and use i ts ala rm as feedback information t o th e adaptive filter, seeFigure 5.16. Here the detector is any scalar alarm device from Chapter 3 usinga transformation st = f ( ~ t ) nd a stopping rule from Section 3 4 There area few alternatives of how to compute a test statistic s t , which is zero meanwhen there is no change, and non-zero mean after the change. First, note thatif all noises are Gaussian, and if the true system s time-invariant and belongsto the modeled linear regression, then

E t E N(O7 St ) , st = Rt + cpptcp t . (5.59)

U tet

Ad. filterE t

Ytc Detector Alarm

c pt , Pt

4

Figure 5.16. Change detect ion as a whiteness residual tes t, using e.g. th e CUSUM test , foran arbitrar y adaptive filter, where th e alar m feedback controls the adap tati on gain.



5.7 A simulation examtie 149

The normalized residual

(5.60)

is then a suitable candidate for change detection.

The main alternative is to use the squared residual

S t = E t S, 1 E t E x 2 n , ) . (5.61)

Another idea is to check if there is a systematic drift in the parameterupdates:

Here the test statistic is vector valued.

Parallel update steps A& = K t e t for the param eters in an adaptivealgorithm is an indication of th at a systematic drift has started. It isproposed in Hagglund (1983) to use

S t = (A&)TA6’-l = Et-lKt-lKtEtT (5.63)

as the residual. A certain filtering approach of the upd ate s was alsoproposed.

The first approach is the original CUSUM test in Page (1954). The secondone is usually labeled as ust the x 2test, since the test statistic s x 2distributedunder certain assumptions, while a variant of the third approach is called theasymptotic local approach in Benveniste et al. (1987a).

After an alarm, the basic action is to increase the gain in th e filter mo-mentarily. For LMS and KF, we can use a scalar factor a and set ,3t = a p tand Q t = a Q t , respectively. For RLS, we can use a small forgetting factor, forinstance At = 0.

Applications of this idea are presented in ection 5.10; see also, for instance,Medvedev (1996).

5.7. A simulation example

The signal in this section will be a first order AR model,



150 AdaDtive

The parame ter is estimated by LS, RLS, LMS and LS with a change detector,respectively. The latte r will be refered t o as Detection LS, and the detectoris the two-sided CUSUM test with the residuals as input. For each method

and design parameter, the loss function and code length are evaluated on allbu t the 20 first da ta samples, to avoid possible influence of transients andinitialization. The RLS and LMS algorithms are standa rd, and he designparameters are the forgetting factor X and step size p , respectively.

5.7.1. Time-invariant AR model

Consider first the case of a constant AR parameter a = -0.5. Figure 5.17shows M D L , as described in Section 12.3.1, and the loss function

as a function of the design paramete r and the parameter tracking, where th etrue parameter value is indicated by a dot ted line. Table 5.1 summarizes theoptimal design parameters and code lengths according to the MDL measurefor this particular example.

Note tha t the op tim al design parameter in RLS corresponds to th e LSsolution and that the step size of LMS is very small (compared to the onesto follow). All methods have approximately the same code length, which islogical.

5.7.2. Abruptly changing AR model

Consider the piecewise constant AR parameter

-0.5 if t 5 100

at = { 0.5 if t > 100

Table 5.1. Optimal code length and design parameters for RLS, LMS and whiteness detec-tion LS, respectively, for constant parameters in simulation model. For comparison, the LSresult is shown.

Method~LS

VDLOptimal par.

1.13 1.13L = 0.002 LMS1.11.12= lLS1.08.11

I Detection LS I v = 1.1 I 1.11 I 1.08 I, ~~~ I I



5.7 A simulation examtie 151

Table 5.2. Optimal code length and design parameters for RLS, LMS and whiteness detec-tion LS, respectively, for abrup tly changing parameters in simulation model. For comparison,the LS result is shown.

I Method I Optimal ar. I MDL I V I

X = 0.9LMS ,U = 0.07

Figure 5.18 shows MDL and the loss function V as a function of th e designparameter and the parameter tracking, where the t rue parame ter value is in-dicated by a dotted line. Table 5.2 summarizes the op tim al design parame tersand code lengths for this particular example. Clearly, an adaptive algorithm ishere much bet ter than a fixed estimate. The updates A& are of much smallermagnitude than the residuals. That is, for coding purposes it is more efficientto tran smit the small parameter updates then the much larger residuals for agiven numerical accuracy. This is exactly what MDL measures.

5.7.3. Time-varying AR model

Th e simulation se tup is exactly as before, but the parameter vector is linearlychanging from -0.5 to 0.5 over 100 samples. Figure 5.19 and Table 5.3 sum-marize the result. As before, th e difference between the adaptive algorithms isinsignificant and there is no clear winner. Th e choice of ada ptat ion mechanismis arbitrary for this signal.

Table 5.3. Optimal code length and design parameters for RLS, LMS and whiteness detec-tion LS, respectively, for slowly varying paramete rs in sim ula tion model. For comparison,the LS result is shown.

Method

1.16 1.16= 0.05MS1.15 1.15= 0.92LS1.36 1.39S

VDLOptimal par.~

I Detection LS I v = 0.46 I 1.24 I 1.13 I



152 AdaDtive

:vu.4 , , , L066f LlLa V.... .............

1.2

Adapfat1on galI d 6 5 07 075 0 8 085 0.9 095

I50 10050 200

Tlme [osmpleo]

........1 5

10 005 0 1 0 1 52 025 0 3 0.35A d a p ta U m gal

1W50 2WTlme sam@es]

Figure 5.17. MDL and V as a function of design parameter and parameter tracking forRLS (a ), LMS (b) and detection LS (c), respectively, for constant parameters in simulationmodel.

Ezzl.5 , , , L066f LlLa V

1.2

des 0 7 0 7 5 0 8 0 8 5 0.9 095Adapfat1on gal

Tlme [osmpleo]10050 200

2 ' t z z...............

, , L0Ssf cdonV

1 5

10 005 0 1 0 1 52 025 0 3 0.35A d a p ta U m gal

Figure 5.18. MDL and V as a function of design parameter and parameter tracking forRLS (a ), LMS (b) and detection LS (c), respectively, for abruptly changing parameters insimulation model.

, , L0Ssf cdonV

1.2

Adapfat1on gal A d a p ta U m gal

Tlme [*ampler] Tlme Isam@es]

Figure 5.19. MDL and V as a function of design parameter and parameter tracking for RLS(a) , LMS (b) and detect ion LS (c ), respectively, for slowly varying parameters.



5.8 AdaDtiveiltersn communication 153

5.8. Adaptive filters n communication

Figure 5.20 illustrates the equalization problem. The t ransmi tted signal ut is

distorted in a channel and th e receiver measures its output yt with additivenoise et . The distortion implies th at th e ou tput may be far away from thetran smit ted signal, and is thus not useful directly. Th e received outpu t ispassed to a filter (equalizer),with the objective to essentially invert h e channelto recover ut.

In general, it is hard to implement the nversion without accepting a delayD of th e overall system. For example, since th e channel might encompass timedelays, a delayless inversion would involve predictions of ut. Furthermore, theperformance may be improved if accepting additional time delay. The delay

D can be determined beforehand based on expected time delays and filterorder of the channel. An alterna te approach is to include the parame ter n thechannel inversion optimization.

The input signal belongs to a finite alphabet in digital communication.For the discussion and most of the examples, we will assume a binary input(in modulation theory called BPSK , Binary Phase Shif t Keying) ,

U t = *l.

In this case, it is natura l to estimate the transmit ted signal with the sign of

the received signal, i i - D = sign(yt).If the time constant of th e channel dynamics is longer th an th e length of

one symbol period, the symbols will reach the receiver with overlap. Thephenomenon is called Inter-Symbol nterference ( I S I ) ,and t implies thati i - D = sign(yt) U t - D . If incorrect decisions occur too frequently, th e needfor equalization arises.

Example 5.75 lnter-symbol nterference

As a standard example in this section, the following channel will be used:B q ) = 1 2.2q-1 + 1.17qP2

ut = *l.

U t S tc Equalizert cChannel

U t - D

Figure 5.20. The equalization principle.



1 5 4 AdaDtive

1

-1

5

10 20 30 40 50

Time [samples]

Figure 5.21. The tran smi tted signal (first plot) is distorted by the channel and the receivedsignal (second plot) does not always have the correct sign, implying decision errors Gtsign(yt) (third plot).

N = 50 binary inputs are simulated according to the irst plot in Figure 5.21.The output yt afte r the channel dynamics is shown in th e second plot. If a

decision is taken directly fit = sign(yt ), without equalization, a number oferrors will occur according to the last plot which shows ut ign(yt). Similarresults are obtained when plotting Ut-D sign(yt) for D = 1 and D = 2.

Example 5.76 IS1 for 16-QAM

The IS1 is best illustrated when complex modula tion is used. Consider,for instance, 16-&AM (Quadrature Amplitude Modulation) , where the trans-mitted symbol can ake on one of 16 different values, represented by fourequidistant levels of bo th the real and imaginary pa rt. The received signaldistribution can then look ike th e left plot in Figure 5.22. After successfulequalization, 16 well distinguished clusters appear, as illustrated in th e rightplot of Figure 5.22. One interpretation of the so called open-eye condition, isthat it is satisfied when all clusters are well separated.

For evaluation, the Bit Error Rate (BER) is commonly used, and it is

defined as

BER =number of non-zero (ut it)

N(5.64)



5.8 AdaDtiveiltersnication 155

Receivedignalonstellationqualizedignalmstellation

-1 .S l i I . .-1 0 1

Real-1 -0.5 0 0.5 1

Real

Figure 5.22. Signal constellation for 16-&AM for the recieved signal (left) and equalizedsignal (right).

where trivial phase shifts (sign change) f the estimate should be discarded. Itis naturally extended to Symbol Error Rate (SER) or general input alphabets(not only binary signals).

Another important parameter for algorithm evaluation is th e Signal-to-Noise Ratio (SNR):

(5.65)

A typical evaluation shows BER as a function of SNR. For high SNR oneshould have BER = 0 for an algorithm to be meaningful, and conversely BERwill always approach 0.5 when SNR decreases.

Now, when the most fundamental quantit ies are defined, we will surveythe most common approaches. Th e algorithms belong to one of two classes:

Equalizers th at assume a known model for th e channel. Th e equalizer iseither a linear filter or two filters, where one is in a feedback loop, or afilter bank with some discrete logics.

Blind equalizers th at simultaneous estimate the transm itted signal andthe channel parameters, which may even be time-varying.

5.8.1. li near equalization

Consider the linear equalizer structure in Figure 5.23. Th e linear filter triesto invert the channel dynamics and the decision device is a static mapping,working according to the nearest neighbor principle.

The underlying aassumption is th at th e transmission protocol is such tha ta training sequence, known to he reciever, is transmitted regularly. Thissequence is used to estimate the inverse channel dynamics C(q) according tothe least squares principle. The domina ting model structure for both channel



156 AdaDtive

U tcChannel Decisionq. C q )q ) A

Zt U t - D

Figure 5.23. A linear equalizer consists of a linear filter C q ) ollowed by a decision device(6t-n = sign(zt) for a binary signal). The equalizer is computed from knowledge of atraining sequence of ut.

and linear equalizer is FIR filters. A few attempts exist for using other moreflexible structures (Grohan and Marcos, 1996). The FIR model for the channelis motivated by physical reasons; the signal is subject to multi-path fading orechoes, which implies delayed and scaled versions of the signal at the receiver.

The FIR model for equalizer structures, where th e equalizer consists of a linearfilter in series with the channel, is motivated by practical reasons. Th e FIRmodel of the channel is, most likely, non-minimum phase, so the natural ARmodel for inverting the channel would be unstable.

An equalizer of order n , C,, is to be estimated from L training symbols utaiming at a total time delay of D. Introduce the loss function

The least squares estimate of the equalizer is now

C,(D) = argminVL(C,, D ) .

Th e designer of th e communication system has three degrees of freedom. Th efirst, and most importan t, choice for performance and spectrum efficiency isthe length of the training sequence, L. This has to be ixed at an early designphase when the protocol, and for commercial systems the sta ndard , is decidedupon. Then, the order n and delay D have to be chosen. This can be done bycomparing t he loss function in the three-dimensional discrete space L, n , D.All this has to be done ogether with th e design of channel coding (errorcorrecting codes that accept a certain amount of bit errors). The examplebelow shows a smaller study.

C ,

Example 5.7 7 Linear equalization: structure

As a continuation of Example 5.15, let the channel be

B ( q ) = 1 2.2q-1 + 1.17qP2

L = 25

U t = *l ,




Impulse response fo r channel

IImpulse response fo r equal izer

I

-21 ICombined impulse respon se o r channel and equal izer

0.02t, , , ,

OO 2 4 6 8Time de lay D

1010 2 4 6 8 I 0 1 2 1 4

Figure 5.24. The loss function V 2 0 o ( C n , D) as a function of time delay D = 1 , 2 , .. . , n

for n = 5,7 ,9 , respectively (a). In (b), th e impulse response of th e channel, equalizer an dcombined channel equalizer, is shown, respectively.

where the tran smit ted signal is white. Com pute the linear equalizer C(q) =CO + c1q-l + . . . + c,q-, and its loss function V L ( C ~ , D ) s a function of thedelay D . Figure 5.24(a) shows the result for n = 5 , 7 , 9 , respectively. For eachn , there is a clear minimum for D M n / 2 .

The more taps n in the equalizer, the bet ter result. Th e choise is a trade-off between complexity a nd performance. Since BER is a monotone functionof the loss function, th e larger n , the better performance. In his case, areasonable compromize is n = 7 and D = 4.

Suppose now that the structure of the equalizer L , n , D is fixed. What isthe performance under different circumstances?

Example 5.78 linear equalization: SNR vs. BERA standard plot when designing a communication system is to plot BER

as a function of SNR.

For tha t reason, we add channel noise to the simulation setup in Example5.17:

yt = ~ ( q ) u t et, Var(et) = o2

B ( q ) = 1 2.2q-1 + 1.17qP2

N = 200

L = 25



158 AdaDtive

l

Figure 5.25. BER (Bit Error Ra te ) as a function of signal-to-noise ratio (SNR) for a linearequalizer.

SNR is now a function of the noise variance 02. et th e training sequenceconsists of the L = 25 first symbols in the total sequnce of length N = 200.Suppose we have chosen n = 8 and D = 4, and we use 100 Monte Carlosimulations to evaluate the performance. Figure 5.25 shows BER as a functionof SNR under these premises.

If we can accept a 5 error rate , then we must design the transmissionlink so the SNR is larger than 100, which is quite a high value.

For future comparison, we note that this particu lar implementation uses4 . 104 loating point operations, which include the design of the equalizer.

5.8.2. Decision feedback equalization

Figure 5.26 shows the struc ture of a decision feedback equalizer. The upperpart is identical to a linear equalizer with a linear feed-forward filter, followedby a decision device. Th e difference lies in th e feedback path from th e non-linear decisions.

One fundamental problem with linear equalizer of FIR type, is the manytaps that are needed to approximate a zero close to th e unit circle in th echannel. For instance, to equalize 1 + 0.9qp1, the inverse filter 1 0.9q-1 +0.92q-2-0 .93q-3 . . . is needed. Wi th the ex tra egree of freedom we now have,these zeros can be put in the feedback pa th, where no inversion is needed. Intheory, D ( q ) = B ( q ) and C q ) = 1 would be a perfect equalizer. However,if the noise induces a decision error, then there might be a recovery problem




Figure 5.26. Th e decision feedback equalizer has one feedforward filter a nd a decision device(for binary input a sign function, or relay), exactly as th e linear equalizer, an d one feedbackfilter from past decisions.

for the DFE equalizer. There is the fundamental design trade-off split thedynamics of the channel between C ( q ) and D ( q ) so few taps and robustnessto decision errors are achieved.

In the design, we assume th a t th e channel is known. In practice, t isestimated from a training sequence. To analyse and design a non-linear system

is generally very difficult, and so is this problem. A simplifying assumption,that dominates the esign described in iterature, is the one of so called CorrectPast Deci sions ( C P D ) .The assumption implies th at we can take the inpu t tothe feedback filter from the true input and we get the block diagram in Figure5.27. The assumption is sound when the SNR is high, so a DFE can only beassumed to work properly in such systems.

We have from Figure 5.27 th at , if there are no decision errors, then

For the CPD assumption to hold, the estimation errors must be small. Th atis, choose C ( q ) and D ( q ) to minimize

There are two principles described in the literature:

The zero forcing equalizer. Neglect the noise in the design (as is quitecommon in equalization design) and choose C ( q ) nd D ( q ) so that q P D

B(q)C(q ) D ( q ) = 0.



160 AdaDtive

Decision errors

Figure 5.27. Equivalent DFE under the correct past decision (CPD) which implies tha t theinput to the feedback filter can be taken from the real input.

e The minimum variance equalizer

+ Ic (e i )H(e i )12 ae (e iw>.

Here we have used Parseval 's formula and an independence assumptionbetween U and e.

In both cases, a constraint of the type CO = 1 is needed to avoid the trivialminimum for C ( q ) = 0, in case th e block diagram in Figure 5.27 does not hold.

The advantage of DFE is a possible considerable performance gain at thecost of an only slightly more complex algorithm, compared o a linear equalizer.Its applicability is limited to cases with high SNR.

As a final remark, the introduction of equalization in very fast modems,introduces a new kind of implementation problem. The basic reason is thatthe dat a rat e comes close t o th e clock frequency in t he computer, and thefeedback path computations introduce a significant time delay in t he feedbackloop. This means that the DFE approach collapses, since no feedback delaycan be accepted.

5.8.3. Equalization using the Viterbi algorithmThe structu re of the Viterbi equalizer is depicted n Figure 5.28. The ideain Viterbi equalization is t o enumerate all possible input sequences, and the




U tc Channel B q )

Yt

Similarity measure FtI

Estimated channel B q )

Figure 5.28. Equalization principle.

one that in a simulation using g (q ) produces the ou tpu t most similar to t hemeasured output is taken as the estimate.

Technically, the similarity measure is the maximum likelihood criterion,assuming Gaussian noise. This is quite similar to taking the sum of squaredresiduals C t ( y t ~ t ( i i ) ) ~or each possible sequence G. Luckily, not all se-quences have to be considered. It turns out that only S n b sequences have tobe examined, where S is the number of symbols in the finite alphabet, and n b

is the channel order.There is an optimal search algorithm for a finite memory channel (where

FIR is a special case), namely the Viterbi algorithm.

Algorithm 5.5 Viterbi

Consider a channel described by yt = j ( d , t ) , where the input ut belongs toa finite alphabet of size S and et is white noise. If the channel has a finitememory n b , so gt = f ( ~ : + ~ ,e t ) , the search for the ML estimate of ut can

be restricted to S n b sequences. These sequences are an enumeration of allpossible sequences in the interval [t b + 1, t ] . For the preceding inputs onlythe most likely one has to be considered.

Derivation: an induction proof of the algorithm is given here. From Bayes’rule we have

In he second equality the finite memory property is used. Suppose th atwe have computed the most likely sequence at time t 1 for each sequence



162 AdaDtive

&-l-nb+l , so we have p ( z ~ - ~ I u : : k ~ + ~ ,t P n b ) as a function of u::kbtl. From the

expression above, it follows that, independently of what value yt has, the mostlikely sequence at time t can be found just by maximizing over ui-nb+l, ince

the maximization over u t P n b haslready been done. 0

The proof above is non-standard. Standard references like Viterbi (1967),Forney (1973) and Haykin (1988) identify t he problem with a shortest routeproblem in a trellis diagram and apply forward dynamic programming. Theproof here is much more compact.

In other words, the Viterbi algorithm ‘inverts’ the channel model B q )without applying k l ( q ) ,which is likely t o be unstable. This works due t othe finite alphabet of the input sequence.

1. Enumerate all possible input sequences within the sliding window of sizen (the channel model length). Append these candidates o th e stimatedsequence i P n .

2. Filter all candidate sequences with B q ) .

3. Est imate the input sequence by t he best possible candidate sequence.That is,

N

Example 5.79 BER vs. SNR evaluation

Consider the BER vs. SNR plot in Example 5.18, obtained using a trainingsequence L = 25. Exactly th e same simulation setup is used here as for the

Viterbi algorithm.

Figure 5.29(a) shows the plot of BER vs. SNR for L = 25. Th e per-formance is improved, and we can in principle tolerate 100 times more noiseenergy. Th e number of flops is here in th e order 2 . 105, so five times morecomputations are needed compared to th e linear equalizer.

Alternatively, we can use th e performance gain to save bandwidth by de-creasing the length of the training sequence. Figure 5.29(b) plots SNR vs.BER for different lengths of the tra ining sequence. As can be seen, a rathershort training sequence, say 7, gives good performance. We can thus ‘buy’bandwidth by using a more powerful signal processor.




U tWChannel q. C(q)( q ) A

Gtecision

.a

Figure 5.31. Str uct ure of a blind equalizer, which adap ts itself for knowledge of yt and z t ,only..

5.8.5. Blind equalization

Th e best known application of blind deconvolution is t o remove the distortioncaused by the channel in digital communication systems. Th e problem alsooccurs in seismological and underwater acoustics applications. Figure 5.31

shows a block diagram for the adaptive blind equalizer approach. The channelis as usual modeled as a FIR filter

and the same model structure is used for the blind equalizer

Th e impulse response of the combined channel and equalizer, assuming FIRmodels for both, is

h k = ( b *C ) k ,

where * denotes convolution. Again, the best one can hope for is h k M m B k - D ,

where D is an unknown time-delay, and m with Iml = 1 is an unknownmodulus. For instance, it is impossible to es tima te the sign of th e channel.The modulus and delay do not mat ter for th e performance, and can be gnoredin applications.

Assume binary signal ut (BPSK). For this special case, th e two most pop-

ular loss functions defining the adaptive algorithm are given by:

12

= E[(1 z’)’] modulus restoral (Godard) (5.68)

12

= - E[(sign(z) )’] decision directed (Sato) (5.69)

The modulus restoral algorithm also goes under the name Constant ModulusA lg or it hm ( C M A ) .Note the convention that decision directed means that thedecision is used in a parame ter update equation, whereas decision feedbackmeans tha t th e decisions are used in a linear feedback path. The steepestdescent algorithm applied to these two loss functions gives the following twoalgorithms, differing only in the definition of residual:



166 AdaDtive

Algorithm 5.6 Blind equalization

A stochastic gradien t algorithm for minimizing one of (5.68) or (5.69) in th ecase of binary input is given by

Pt = (Yt-1, Y t- 2 , . . . , Yt-n )

xt = Pt Qt-l

Pdard ( l 2, )

T

T A

= sign(xt) xt

Qt = Qt-1 + PPtEt...

i i - D = sign(zt).

(5.70)

(5.71)

(5.72)

(5.73)

(5.74)

(5.75)

The extension to other input alphabets than f l is trivial for the decisiondirected approach. Modulus restoral can be generalized to any phase shiftcoding.

Consider the case of input alphabet ut = f l For successful demodulationand assuming no measurement noise, it is enough th a t th e largest componentof h k is larger than he sum of the other components. Then he decoderQt-D = sign(xt) would give zero error. This condition can be expressed asmt > 0, where

If the equalizer is a perfect inverse of the channel (which is impossible forFIR channel and equalizer), then mt = 1. The standard definition of a socalled open-eye condition corresponds to mt > 0, when perfect reconstructionis possible, if there is no noise. Th e larger mt , the arger noise can be tolerated.

Example 5.21 Blind equalization

Look at t he plot in Figure 5.32 for a n example of how a blind equalizerimproves the open-eye measure with time. In this example, t he channel B (q ) =0.3q-1 + lq-' + O.3qp3 is simulated using an input sequence taken from thealphabet [ - l , l ] . The initial equalizer parameters are quite critical for theperformance. They should at least satisfy the open-eye condition mt > 0(here m0 M 0.5). Th e algorithms are nitiated with C ( q ) = 0 0.L-1 +lq-' O.lqP3 + OqP4. At the end, the equalizer is a good approximation ofthe channel impulse response, as seen from Figure 5.33.



5.9 Noise cancelation 167

Open eye condition1

0.9-

Residuals

0.4

0.3

0.2

.

Gcdard

200 400 600Time [samples] Time [samples]

Figure 5.32. Open-eye measure (a) and esidual sequences (b) for Sato’s and Godard’salgori thms, as a function of number of samples.

5.9. Noise cancelation

Noise cancelation, comprising Acoustic Echo Cancelat ion (AEC) as one ap-plication, has found many applications in various areas, and typical examplesinclude:

Loudspeaker telephone systems in cars, conference rooms or hands-freesystems connected to a computer, where the feedback path from speakerto microphone should be suppressed.

Echo path cancelation in telephone networks (Homer et al., 1995).

The sta ndard formulation of noise cancelation as shown in Figure 5.1 falls intoth e class of general adaptive filtering problems. However, in practice there arecertain complications worth special attention, discussed below.

5.9.1. Feed-forward dynamics

In many cases, the dynamics between speaker and microphone are not negligi-ble. Then , the block diagram must be modified by the feed-forward dynamicsH ( q ) in Figure 5.34(a). This case also includes the si tua tion when the trans-fer function of the speaker itself is significant. Th e signal estim ate at themicrophone is

i t = S t + (G(4) F ( q ;Q ) H ( q ) ) u t . (5.76)For perfect noise cancelation, we would ike the adaptive filter t o mimicG ( q ) / H ( q ) .We make two immediate reflections:



168 AdaDtive

Channel

l

Equalizer

-21 lCombined channel+eaualizer

2b l

1 2 3 4 5 6 7Time [samples]

Figure 5.33. Impulse response of channel, equalizer and combined channel-equalizer. Th eoverall response is close to an impulse function with a delay of D = 4 samples.

1. A necessary requirement for successful cancelation is th at there must bea causal and stable realization of the adaptive filter F q ; 3 for approx-

imating G ( q ) / H ( q ) .This is guaranteed if H ( q ) is minimum phase. IfH ( q ) contains a time delay, then G(q) must also contain a time delay.

2. Least squares minimization of the residuals E t in Figure 5.34(a) willgenerally not give the answer F ( q ;6 ) M G ( q ) / H ( q ) . he reason is tha tthe gradient of (5.76) is not t he usual one, and may point in any direction.

However, with a little modification of the block diagram we can return to thestanda rd problem. A very simple algebraic manipulation of (5.76) yields

This means that we should pre-filter th e inpu t to get back to the sta ndardleast squares framework of system identification. Th e improved alte rnat ivestructure with pre-filter is illustrated in Figure 5.34(b) and (c)

For a FIR structure on F ( q ;Q ) , this means that we should pre-filter theregressors in a linear regression. This input pre-filtering should not be confusedwith the pre-filtering often done n system dentification (Ljung, 1999), inwhich case the identified system does not change. Th e pre-filtering is thebackground for the so called filtered-input LMS,or filtered-X LMS,algorithm.The algorithm includes two phases:



170 AdaDtive

cMic. 2)

(a) Feed-forward dynamics in noise cancelling implies a convergence problem.

W F(9) H

jj (Speaker)9

H - lg.

(b) Alternative structure with pre-filtering.

U = Hu

(Mic. 1)c H-'

-G Y = S

F ( 9 )jj (Speaker)

9

W Alg.

(c) Equivalent alternative structure.

Figure 5.34. Feed-forward dynamics in noise cancelation

Mic. 2)

cMic. 2)



5.9 Noise cancelation 171

5.9.2. Feedback dynamics

Acoustic feedback is a major problem in many sound applications. The mostwellknown example is in presentations and other performances when the soundamplification is too high, where th e feedback can be a painful experience. Thehearing aid application below is another example.

The setup is as in Figure 5.35. The amplifier has a gain G(q), while theacoustic feedback block is included in Figure 5.35 to model the dynamics H ( q )between speaker and microphone. In contrast t o noise cancelation, there isonly one microphone here. To counteract t he acoustic feedback, an internaland artificial feedback path is created in the adaptive filter F (q ; ) .

H .

Gy Speaker

- 4Alg.

be- F ( B )

.Figure 5.35. Feedback dynamics in noise cancelation: th e speaker signal goes to the micro-phone.

Th e least squares goal of minimizing th e residual E t from the filter inputgt (speaker signal) means that th e filter minimizes

E t = U t F ( % >Yt+ H M Y t .

When F is of FIR type, this works fine. Otherwise there will be a bias in theparameters when the input signal ut is colored, see Hellgren (1999), Siqueiraand Alwan (1998) and Siqueira and Alwan (1999). This is a wellknown fact inclosed loop system identification, see Ljung (1999) and Forssell (1999): colorednoise implies bias in the identified system. Here we should interpret H as thesystem t o be identified, G is the known feedback and U the system disturbance.

Example 5.23 Hearing aids adapt iv e noise cancelation

Hearing aids are used t o amplify external sound by a possibly frequencyshaped gain. Modern hearing aids implement a filter customized for th e user.However, acoustic feedback caused by leakage from th e speaker t o the mi-crophone is very annoying to hearing-aid users. In principle, if th e feedback



172 AdaDtive

Figure 5.36. An ‘in-the-ear’ hearing aid with a built n adaptive filter. (Reproduced bypermission of Oticon.)

transfer function was known, it can be compensated for in the hardware. Oneproblem here is the time variability of the dynamics, caused by a change in in-terference characteristics. Possible causes are hugs or objects like a telephonecoming close to th e ear. The solution is t o implement a n adaptive filter in th ehearing aid.

There is a number of different companies producing such devices. Figure5.36 shows a commercial product from Oticon, which is th e result of a jointresearch project (Hellgren, 1999). Here it should be noted that the feedbackpath differs quite a lot between the types of hearing aids. The ‘behind-the-ear’ model has the worst feedback pa th. The ‘in-the-ear’ models att enuatethe direct pa th by their construction. For ventilation, there must always be asmall air channel from the inne r to the outer ea r, and this causes a feedback

path.



5.10 Applications 173

5.1 0. Applications

5.10.1. Human EEG

Th e da ta described in Section 2.4.2 are measurements from human occipitalarea. The experiment aims at test ing the reaction delay for health diagnosis.Before a time t b = 387 the lights are on in a test room and the test personis looking at something interesting. Th e neurons in the brain are processinginformation in visual cortex, and only noise is seen in measurements. Whenlights are turned off, the visual cortex has nothing to do. The neuron clustersstart 10 Hz periodical rest rhythm'. The delay between t b and the actualtime when the rhythm starts varies strongly between healthy people and, forinstance, those with Alzheimer's decease.

Figure 5.37 shows a quite precise location of th e change point. Th e delayfor detection can thus be used as one feature in health diagnosis. That is,medical diagnosis can be, at least partly, automized.

EEG signal on a human when lights turned on at time 387I '

-60 I-80

0 100 200 300 400 500 600 700

Sampel number

Figure 5.37. Human EEG, where the brain is excited at sample 387

5.10.2. DC motor

Consider th e DC motor lab experiment described in Section 2.5.1. Here weapply the RLS algorithm with a whiteness residual test with he goal to detectsystem changes while being insensitive to disturbances. Th e squared residualsare fed to the CUSUM test with h = 7 and v = 0.1. Th e physical model struc-ture is from (2.1) an ARX(4,4,1) model (with appropriate noise assumptions),which is also used here. All test cases in Table 2.1 are considered.



174 AdaDtive

Parameter estimation, data set 1

21

0

-1

-2

-3500 1000500 2000 2500

Test statistics from RLS whiteness test

: 50 100ampel number50 200 250

Parameter estimation. data set 3

Parameter estimation, data set

21

0

-1

-2

-3 ' ' 1500 1000500 2000 2500

Test statistics from RLS whiteness test

Sampel number

(b)Parameter estlmation. data set

3

2

1

0

-1

-2

1000500 2000 2500 30 500 1000500 2000 2500

Test statistics fromLShltenessestest statistics fromLShltenessest

00 50 100 150 200 250 00 50 100 150 2

Sampel numb er Sampel numb er250

(c) ( 4

Figure 5.38. Recursive parameter estimation with hange detection of a DC motor. Nominalsystem (a), with torque disturbances (b), with hange in dynamics (c) and both disturbanceand change (d)

From Figure 5.38 we can conclude the following:

Not even in the nominal case, the parameters converge to the vicinityof the values of the theoretical expressions in (2.1). The explanation isth at th e closed loop system is over-modeled, so there are poles and zerosth a t cancel and thus introduce ambiguity. An ARX(2,2,1) model givesvery similar test statistics but bet ter parameter plots.

It is virtually impossible to tune the CUSUM test t o give an alarmfor system changes while not giving alarms for the non-modeled distur-bances.



5.10 Amlications 1 7 5

5.10.3. Friction estimation

This section continues the case study discussed in Chapter 1 in the series ofExamples 1.2, 1.5 and 1.9. The relations of interest are

wd

W nS , =

P m = P + e p

_ = s + S , + b ~ + e ,

N is a computable normal force on the tire, S is an unknown offset in slipthat depends on a small difference in wheel radii - which will be referred toas slip offset ~ and e is an error tha t comes from measurement errors in wheel

velocities. As before, st is the measured wheel slip, and pt is a measurednormalized trac tion force. Collecting the equations, gives a linear regressionmodel

1k v V v

S r n - 6 R = p m + 1 . S, . (5.78)

Y t ) Xdt) 4 t ) @ d t )

Th e theoretical relation between slip and trac tion orce is shown in Figure 5.39.The classical tire theory does not explain the difference in initial slope, so em-

pirical evidence is included in this schematic picture. During normal driving,the normalized trac tion force is in the order of 0.1. That is, measurements arenormally quite close to th e origin.

are used, for different purposes.Figure 5.40 summarizes the estimation problem. Two estimation approaches

The least squares method is used for off-line analysis of dat a, for testingth e da ta quality, debugging of pre-calculations, and most importantlyfor construct ing the mapping between slip slope and friction. Here onesingle value of th e slip slope is estimated for each batch of data , whichis known to correspond to a uniform and known friction.

The Kalman filter is used for on-line implementation. Th e main ad-vantage of the Kalman filter over RLS and LMS is tha t it can be assigneddifferent tracking speeds for th e slip slope and slip offset.

Least squares estimation

The least squares estimate is as usually computed by

\ - l N



176 Adaptive filtering

Figure 5.39. Theoretical and empirical relat ion between wheel slip and tract ion force.

l*m Filter I

Figure 5.40. Filter for friction estimation.

There are some challenging practical problems worth mentioning:

1. Outliers. Some measurements just do not make sense to the model.There may be many reasons for this, and one should be careful whichmeasurements are put into the filter.

2. The variation in normalized traction force determines the stochastic un-certainty in the estimates, as shown by the following result for the pa-rameter covariance PN = C o d ( N ) from N data, assuming constantparameters:

5.79)

That is, if the variation in normalized traction force is small during thetime constant N of the filter, the parameter uncertainty will be large.

3. A wellknown problem in the least squares theory occurs in the case oferrors in the regression vector. A general treatment on this matter, usu-ally referred to as errors in variables or the total least squares problem,



5.10 Amlications 177

is given in van Huffel and Vandewalle (1991). In our case, we have mea-surement and computation errors in the normalized traction force p t .

Assume that

p? = pt + W:, Var(v:) = X (5.80)

isused in p t . Here, Var(v:) means the variance of th e error n hemeasurement p?. Some straightforward calculations show th at th e noisein p? leads to a positive bias in the slip slope,

(5.81)

Here, =(p) is the variation of the normalized traction force defined as

N

Nt=l=l

(5.82)

This variation is identical to how one estimates the ariance of a stochas-tic variable. Normally, the bias is small because = p ) >> X That is,the variation in normalized traction force is much larger than its mea-surement error.

An example of a least squares fit of a straight line to a set of data, whenthe friction undergoes an abrupt change at a known time instant, is shown inFigure 5.41.

The Kalman filter

We will here allow time variability in the parameters, kt and S t . In this ap-plication, the Kalman filter is the most logical linear adapt ive filter, becauseit is easily tuned to track parameters with different speeds. Equation (1.9)is extended to a sta te space model where the parame ter values vary like arandom walk:

O(t+ 1) = ot +vt

Yt = P p t + et, (5.83)



178 AdaDtive

- - Friction mo del after change

0.05t/

0.005Slip S0.01.015 0.02

Figure 5.41. Least squares estimation of friction model para met ers to a set of da ta whereth e friction changes at a known time instant, when t he least squares estimation is restarted.

where

T

Qt = ( S,)Here ut and et are considered as independent white noise processes. Th e basicproblems of lack of excitation, outliers and bias in the slip slope during cruisingare the same as for least squares estimation, bu t explicit expressions are harderto find. A good rule of thumb, is to relate the time constant f th e filter, to th etime window N in the least squares method, which can lead to useful insights.

Th e filter originally presented in Gustafsson (1997b) was extended to adynamic model in Yi et al. (1999). Th e idea here is t o include a completemodel of the driveline in the Kalman filter, instead of using the sta tic mappingfrom engine torque t o p t , as above.

Pract ical experience rom test trials

This section summarizes results in Gustafsson (199713) and Gustafsson (1998).The filter has been running for more th an 10000 km, and almost 1000 docu-mented tests have been stored. The overall result is th a t th e linear regression




model and Kalman filter are fully validated. Th e noise et in (5.83) has been an-alyzed by studying the residuals from the filter, and can be well approximatedby white Gaussian noise with variance 10W7. he offset S s slowly time-varying

and (almost) surface independent. Typical values of the slip slope L are alsoknown, and most importantly , they are found to differ for different surfaces.Th e theoretical result th at filter performance depends on excitation in pt isalso validated, and it has also turned out that measurements or small slip andp (pt < 0.05) values are not reliable and should be discarded.

A simulation study

Th e general knowledge described in Section 5.10.3 makes it possible to perform

realistic Monte Carlo simulations or studying the ilter response, which is veryhard to do for real data . Wha t is importa nt here is to use real measured p t ,

below denoted PT. The slip da ta is simulated from th e slip model:

(5.84)

Th e noise et is simulated according to a distribution estimated from ield tests,while real data are used in pp. In t he simulation,

St = 0.005, kt = 40, t < 200St = 0.005, kt = 30, t 2 200.

This corresponds to a change from asphalt to snow for a particular tire. Theslip offset is given a high, but still normal value. The noise has variance 10W7,which implies that th e signal is of th e order 8 times larger than the noise. Th eused normalized traction force pt and one realization of s t is shown in Figure5.42.

As in the real-time implementation, the sampling interval is chosen to 0.2s.Note tha t th e change time at 200 is carefully chosen to give the worst possiblecondition in terms of excitation, since the variation in normalized trac tionforce is very small after this time, see Section 5.10.3.

First , the response of the Kalman filter presented in th e previous sectionis examined. The design parameters are chosen as

Qt = ( 0 0)0.3 0

This implies that the slip slope is estimated adaptively, while the offset hasno adapta tion mechanism. In practice, the (2,2) element of Q is assigned



180 AdaDtive

0 50 100 15000500050

20.42e

c 0.3-

350Time [samples]

Figure 5.42. The measured normalized traction force (lower plot ) and one ealizationof a simulated s t (upper p lot) . Da ta collected by th e a uth or in collaboration with Volvo.

a small value. The result from th e filter using the da ta in Figure 5.42 areshown in Figure 5.43. It is evident that t he convergence time of 50 samples,corresponding to 10 seconds, is too long.

In the real-time implementation, the Kalman filter is supplemented by aCUSUM detector. For the evaluation purpose in this section, we use an RLSfilter without adaptation (X = 1). That is, we are not able to follow slowvariations in the parameters, but can expect excellent performance for piece-wise constant parameters (see Algorithm 3.3). The LS filter (see Algorithm5.7) residuals are fed into a two-sided CUSUM detec tor, and the LS filter isrestarted after an alarm. The design parameters in th e CUSUM test are athreshold h = 3 . 10P3and a drift v = 5 . 10P4. The drift parameter effectivelymakes changes smaller than 5% impossible to detect. After a detected change,the RLS filter is restarted.

The performance measures are:

Delay for detection (DFD).

Missed detection rate (MDR).

Mean time to detect ion (MTD).

The sum of squared prediction errors from each filter,




0.05- - Trueparameters

0,045stimated parameters, slow filterstimated parameters, fast filter

0.04 ~~~~~~~~~~~~~

Figure 5.43. Illustration of parame ter tracking after an abrupt friction change at sample200 using th e da ta in Figure 5.42 and a Kalman filter.

Here a delayed sta rt index is used in the sum to ule out effects from thetransient.

As argued in Gustafsson (1997a), th e Minimu m Descr ip t ion Length

(MDL) can be interpreted as a norm for each method.

Th e root mean square parameter error

Algorithmic simplicity, short computation time and design complexity.This kind of algorithm is implemented in C or even Assembler, and thehardware is shared with other functionalities. There should preferably beno design parameters if the algorithm has to be re-designed, for instancefor another car model.

The parameter norm PE is perhaps a naive alternative, since the size of theparameter errors is not always a good measure of performance. Anyway, it isa measure of discrepancy in the parameter plots to follow.

Figures 5.44 and 5.45 show the parameter estimates rom these filters aver-aged over 100 realizations and the histogram of alarm times, respectively. Th emain problem with the CUSUM test is the transient behavior after an alarm.In this implementation, a new identification is initiated after each alarm, butthere might be other (non-recursive) alternatives tha t are better.



182 AdaDtive

0.05stimated parameters

0,045 - - True arameters

Figure 5.44. Illustration of parame ter tracking after an abrupt friction change at sample200 using th e da ta in Figure 5.42 an d CUSUM LS filter.

E E l0 195 200 205 210 215 220 225

0.05 I I:::l0.02

0.01

0

50 1W 1500050 300

Figure 5.45. Histogram over delay for detection n number of samples (T, = 0.2s) forCUSUM LS filter for 100 realizations. The same estimates as in Figure 5.44 averaged over100 realizations are also shown.

Table 5.4. Comparison of some norms for the examined filters. The smaller the nor m, thebett er the performance.

Method

14.7 0 0 1.41 . 10P7.3.0P7 0.0058 1.11USUM

~ ~ 1.57.0P7.1. 10P7 0.0043 lRLS

MTD MDR FAR VN MDL PE Time



184 AdaDtive

70 l

1015 10 15 20 25

Tlme p ]0

Figure 5.46. Illustr ation of tracking ability in t he est ima ted lip slope as a function of time.After 8 S an asphalt road changes to a short gravel path that ends at 16 S, so the tru e slipslope is expected to have abr upt changes at these time instants.

and then back to asphalt is considered. Figure 5.46 shows the estimated slipslope as a function of time in one of the test s, where the gravel pa th star tsafter 8 seconds and ends after 16 seconds. Gravel roads have a much smallerslip slope than asphalt roads.

Note that the Kalman filter first s ta rts to ad ap t to a smaller slip slope att = 8 S, but after three samples (0.6 S) something happens. This is wherethe CUSUM detector signals for an a larm and we have convergence after onemore sample. Similar behavior is noted at the end of th e gravel pa th. TheKalman filter first increases L slightly, and afte r some samples it speeds updramatically and then takes a couple of seconds to converge to a stable value.

Final remarks

A summary of the project is shown in Figure 5.47. This application stud yshows the similarity between th e use of adaptive (linear) filters and changedetectors for solving parameter estimation problems. A linear adaptive fil-ter can be tailored to the prior information of parameter changes, but theperformance was still not good enough. A non-linear adaptive filter as a com-bination of a linear adaptive filter and a change detector was able to meetthe requirements of noise at tenuation and fast tracking. As a by-product, theskid alarm functionality was achieved, which is a relative measure of abr up tfriction changes. It turned out that tire changes and weariness of tires madeit necessary to calibrate the algorithm. If an absolute measure of friction isneeded, an automatic calibration routine is needed. However, as a skid ala rmunit, it works without calibration.



5.11 SDeech codina in GSM 185

1

Classificationilter

Figure 5.47. Overview of the friction estimation algorithm.

5.11. Speech coding n GSM

Figure 5.48 shows an overview of t he signal processing in a second generationdigital mobile telephone system. In this section, we will present t he ideas inspeech coding. Equalization is dealt with in Example 5.20.

Generally, signal modeling consists of a filter and an input model. For theGSM speech coder, the ilter is an estimated AR(8) model (parameterized witha Lattice structure) and the input is a pulse train. There are many tricks t o

get a n efficient quantization and robustness to bit errors, which will be brieflymentioned. However, the main purpose is to present th e adaptive estimatorfor coding. Decoding is essentially the inverse operation.

The algorithm outlined below is rather compac t, and the goal is to get aflavour of the complexity of the chosen approach.

Filter model

Denote t he sampled (sampling frequency 8192 Hz) speech signal by St. The

main steps in adaptive filter estimation are:

1. Pre-filter withS{ = (1 0.9q-l)st.

This high pass filter highlights the subjectively important high frequen-cies around 3 kHz.

2. Segment th e speech in fixed segments of 160 samples, corresponding to 20ms speech. Each segment is treated independently from now on.

3. Apply the Hamming window

sigw CWtSt s



186 AdaDtive

Sampled speech Sampled speech

Speech encoder

Bit rearrangementit rearrangement t

Speech decoder

4 tChannel coding Channel decoder

4 tInterleaving De-interleaving

4 tEncryption Decryption

4 tBurst arrangement Channel equalization

4 tModulation Demodulation

Channel

Figure 5.48. Overview of adigital mobile telephone system, e.g. GSM. The underlinedblocks are described in detail in th is book.

to avoid any influence from the signal tails. The constan t c = 1.59 makesthe energy invariant t o th e windowing, that is c2 C W: = 1.

Estimation of AR(8) model

4. The AR(8) model is determined by first estimating the covariance function

160

t= l

f o r k = 0 , 1 , 2 . . . ,8.

5. The AR coefficients could be estimated from the Wiener-Hopf equations.For numerical purposes, another parameterization using reflection coef-ficients p1, p2, . . . ,p8 is used. The algorithm is called a Schur recursion.

6. Th e reflection coefficients are not suitable t o quantize, since the resolutionneeds to be bet ter when their numerical values are close to f l . Insteadthe so called logarithmic area ratio is used



5.11 SDeech codina in GSM 187

The number of bits spent on each parameter is summarized below:

The decoder interpolates the LAR parameters to get a softer transitionbetween the segments.

Input model

The input s essentially modeled as a pulse tra in motivated by the vocal chord,

which generates pulses that are modulated n the throat and mouth . The oderworks as follows:

7. Decode the filter model. Th at is, invert the steps 4-6. This gives ifsw nd

This is essentially the prediction error, but it also contains quantizationeffects.

8. Split each segment into four subsegments of length 40. These are treatedindependently for each input model. Initialize a buffer with he 128latest prediction errors.

9. Estimate the covariances

40

t= l

for k = 41 ,42 , .. . 128.

10. Compute

~ = a r g max R k411k1128

G = max R k ,41<k<128

where D corresponds to th e pulse interval and G is the amplificationof t he pulse size compared to the previous subsegment. G is quantizedwith 2 bits and D with 7 bits (D E [ l 281).



188 AdaDtive

11. Each pulse size is saved. To get efficient quantization, a long termprediction filter models the pulse size variations,

Et = E t GEt-D.

Since D > 40 the last term is taken from the buffer. This can be seenas an AR(1 ) model with sampling interval D units and a1 = G. Notethat the overall filter operation is now a cascade of two filters on twodifferent time scales.

12. Decimate the sequence Et a factor of 3. That is L P filter with cut-offfrequency wg = 4096/3 Hz and pick out every third sample. This gives13 samples.

13. Depending on at which sample (1,2 or 3) the decimation starts, there arethree possible sequences. This offset is chosen to maximize th e energyin the pulse sequence. The decoder constructs t he pulse tra in by usingthis offset and inserting two zeros between each two pulses.

14. Normalize the pulses for a more efficient quantization:

p . _.ia -V

Because of the structure, the input model is sometimes referred t o as the longtime prediction and the filter as the short time prediction.

Th e number of bits used for modeling each subsegment of the inpu t isshown below:

Code length

All in all, each segment of 160 sample (20 ms) requires:

156



5.A Sauare root imdementation 189

Channel coding

The 260 bits to each segment are sorted after priority of importance.

I 50 most important I 132 important I 78 less important I

To the most important, three parity bits are added o detect bit errors. Tothese 53 bits and the next 132 a cyclical convolution code is applied (denotedCC(2,1,5), which is a fifth order FIR filte r). This gives 2 x 189= 378 bits. Theless important bits are sent as they are.

Thus, in total 456 bits are sent to each segment. These are transmit ted infour independent bursts, each consisting of 116 bits.

They are complemented by a training sequence of 26 bits used for equal-ization (see Example 5.20), and three zeros at each side. This results in a socalled normal b urst, constructed as

I 3 zeros I 58 I 26 training bits I 58 I 3 zeros I

The code rate is thus

81921160.4.1488192.8

= 0.46,

so the coded speech signal requires 46% of the bandwidth of what an uncodedsignal would require.

5.A. Square root mplementation

To motivate the problem, consider the linear regression problem in matrixnotation Y = QT8 + E . One way to derive the LS estimate is to solve theover-determined system of equations

Y = QTe (5.85)

Multiple both sides with @ and solving for 8 gives

QY = * = ( @ @ T ) - l @Y.

It is the inversion of @QT which may be numerically ill-conditioned, and theproblem occurs already when the matrix product is formed. One remedy is toapply the QR-factorization to th e regression matrix , which yields



190 AdaDtive

where Q is a square orthonormal matrix ( Q T Q = &QT = I ) and R is an uppertriangular matrix. In MATLABTM , t is computed by [Q, R1=qr Phi ' 1.

Now, multiply (5.85) with Q - l = Q T:

Th e solution is now computed by solving th e triangular system of equationsgiven by

which has good numerical properties. Technically, th e condition number of@QT is the square of the condition number of R, which follows from

It should be noted that MATLABTM 'S backslash operator performs a QRfactorization in the call t h h a t = P h i \Y. As a further advantage, the minimizingloss function is easily computed as

v(e)= (Y @T(Y @.e) = %?-K.

Square root implementations of recursive algorithms also exist. For RLS,Bierman's UD factorizat ioncan be used. See, for instance, Friedlander (1982),Ljung and Soderstrom (1983), Sayed and Kaila th (1994), Shynk (1989), Slock(1993) and Strobach (1991) for other fast implementations.

5.B. Derivations

In this section, different aspects of th e least squares solution will be high-lighted. The importance of these results will become clear in later chapters,in particular for likelihood-based change detection. Firs t a recursive versionof weighted least squares is derived together with the LS over a sliding win-dow (WLS). Then it is shown how certain off-line expressions tha t will nat-urally appear in change detection algorithms can be updated on-line. Someasymptotic approximations are derived, which are useful when implementingchange detectors. Finally, marginalization corresponding to th e computationsin Chapter 3 for linear regression models are given. Here the parameter vector19generalizes the mean in a changing mean model.



5.B Derivations 191

5.B.1. Derivation of L S algorithms

We now derive a recursive implementation of the weighted LS algorithms.In contrast to a common practice n the lite rature when RLS is cited, noforgetting factor will be used here.

Weighted least squares in recursive form

algorithm. Th e weighted multi-variable recursive least squares without forget-ting factor (see Ljung and Soderstrom (1983)) minimizes the loss function

t

W ) C ( Y k pk 0) R, (Yk 43T -1 (5.86)k=l

with respect to 8 . Here RI, hould be interpreted as the measurement ariance,which in this appendix can be a matrix corresponding to a vector valuedmeasurement. Differentiation of (5.86) gives

t

v , ( Q ) = 2 C(- 'P lcR, lYI , + cpkR,lp;Q). (5.87)k=l

Equating (5.87) to zero gives the least squares estimate:

\ -1 t

(5.88)

This is the recursive update for the RLS estimate, together with th e obviousupdate of R::

R: = R:-1 + p t R t p t .1 T (5.90)



192 AdaDtive

It is convenient to introduce Pt = (R:)-'. The reasons are o avoid theinversion of R:, and so that Pt can be interpreted as the covariance matrix of8. The matrix invers ion lemm a

applied to (5.90) gives

Pt = [PL; + P& Pt 11 T 1

= Pt-l Pt-l'Pt['P:Pt-l'Pt + l -1 Pt Pt - l . (5.92)

With this upda te formula for Pt we can rewrite (R?)- 'p t in (5.89) in thefollowing manner:

( R , Pt = Pt-lPt Pt-l'Pt['Pt Pt-lPt + RtI-l'P:Pt-l'Ptp -1 T

= pt-l'Pt['Pt t-lcpt + - l .T (5.93)

We summarize the recursive implementation of the LS algorithm:

Algorithm 5.7 Weighted LS in recursive form

Consider th e linear regression (5.3) an d the loss function (5.86). Assume that

the LS estimate at time t 1 s 8t-1 with covariance matrix Pt - l . Then a newmeasurement gives the update formulas:

8, = 8t-l +Pt-l'Pt['P, t-lcpt + &- l ( y t PT 8 t - 1 ) (5.94)

Pt = pt-l Pt-l'Pt['P, Pt lcpt + RtI-l'PTPt-1. (5.95)

Here the initial conditions are given by t he prior, 8 E N(80,Po). The a poste-riori distribution of the parameter vector is

Ot E N(Jt,Pt).

The interpretation of Pt as a covariance matrix, and the dist ribu tion of Ot

follows from the Kalman filter. Note that t he mat rix t ha t has to be invertedabove is usually of lower dimension than R? (if dim y < dim 8).

Windowed LS

In some algorithms, the parameter distribution given the measurements in asliding window is needed. This can be derived in the following way. The trickis to re-use old measurements with negative variance.



194 AdaDtive

Th e density function for the measurements can equivalently becomputed from he residuals. With a little abuse of notation,p(yN) = The um of squared on-line residuals elates to

the sum of squared off-line residuals as:

N

C(Yt PFJt-dT (PFPt-lPt + (Yt P F L )t = l

(5.96)N

The on-line residual covariances relate to the a posteriori param-eter covariance as:

N

t = lN

= C og det Rt + og det PO og det PN. (5.97)t= l

Together, this means that the likelihood given all available mea-surements can be computed from off-line statistics as:

p(yN) - ( Y ~ I ~ N > P ~ ( ~ N )d e t ~ N ) l ’ ~ ( 5 . 9 8 )

which holds for both a Gaussian prior p o ( z ) and a flat non-informative one pe x) = 1.

The matrix notation

Y = @ 6 + E (5.99)

will be convenient for collecting N measurements in (5.3) into one equation.Here



5.B Derivations 195

Thus, A is a block diagonal matrix. The off-line version of RLS in Algorithm5.7 can then be written

ON = (R: + R%)-l(fN + f0)

PN = (R: + R%)- ' , (5.100)

where

N

t=l

N

t=l

or, in matrix notation,

Here 00 and R: are given by th e prior. With a Gaussian prior we have that 0 EN(O0, (R:)-') and a flat non-informative prior is achieved by letting R: = 0.

Note that a flat prior cannot be used on-line (at least not directly).First, an on-line expression for the density function of th e data is given.

This is essentially a sum of squared prediction errors. Th en , an off-line ex-pression is given, which is interpreted as the sum of squared errors one wouldhave obtained if th e final parameter estimate was used as the t rue one fromthe beginning.

Lemma 5.2 On-line expression for the densi ty functio n)The density function of the sequence y N = { y k } k = l from (5.3) is computed

recursively bylogp(yN) = logp(yN-') P og 27r log det ( & P N - ~ ( P N R N )

2 2

= logp(EN).



5.B Derivations 197

By completing the squares the exponent can be rewritten as (where 8 = 8,)

X exp (-: ( (Y @ T 8 ~ ) TA - ' ( Y aT8,) + (8 80)TR, (8N 0 ) ) )

X S ( 2 7 r - f (detR: +R $ ) ; exp -z( 8 8 (R: + R $ ) ( 8 8 d81 T

= p ( y N1 6 , ) ~ ~e,)(detP,)'" ( 2 7 ~ ).

The case of non-informative prior, p e ( 0 ) = 1 in (5.103), is treated by lettingR: = 0 in the exponential expressions, which gives

p ( y ~ ) p ( y ~ l e N ) 2n)-r exp (-: (0 (R: + R % ) 0 6

= p ( y Nl e , ) (det pl~)'''.0

It is interesting t o note th at th e density function can be computed by jus tinserting the final parameter estimate into the prior densities p(yNlO) and



198 AdaDtive

pe Q), apart from a data independen t correcting factor. Th e prediction errorsin the on-line expression is essentially replaced by the smoothing errors.

Lemma 5.4

Consider the covariance matrix Pt given by t h e RLS algori thm 5.7. T h e fol-lowing relation holds:

Nc % det (cpTPt-lcpt + Rt)t=l

N

= c og det Rt + og det PO og det PN.t=l

Proof: By using the alterna tive update formula for th e covariance matrix ,P;' = P;: +vtRt 1 cptT and the equality de t ( l + A B ) = d e t ( I + B A ) we have

t=lN

= c log det Rt + og det (I +RF'pFPt- lcpt ) )t=lN

= c log det Rt + og det (I + cptR;'cpFPt-l))t=lN

= c log det Rt og det P;; + log det (P;; + cptR;'cpT))

t=lN

= c log det Rt og det P;: + log det P;')t=lN

= c og det Rt + og det PO og det PNt=l

and the off-line expression follows. In th e last equality we used, most of the

terms cancel. 0

Actually, the third equality (5.98) we want to show follows from the firsttwo (5.97) and (5.96), as can be seen by wri ting out the density functions in



5.B Derivations 199

Lemmas 5.2 and 5.3,

Thus, we do not have to prove all of them. Since th e first (5.97) and third(5.98) are proved the second one (5.96) follows. A direct proof of the secondequality is found in Ljung and Soderstrom (1983), p 434.

5.B.3. Asymptotic expressions

The next lemma explains the behavior of the oth er model dependent factorlog det PN (besides the smooth ing errors) n Lemma 5.3.

Lemma 5.5 Asym ptoti c para mete r covariance)Ass um e tha t the regressors sa t i s fy th e c ondi t ion tha t

exis ts and is non-s ingular. Then

log det PNlog N

+ - d , N + C C

withprobabi l i tyone .Thus , or a rge N , logdet PN approximatelyequals-dlog N.



200 AdaDtive

Proof: We have from the definition of R;

og det PN = log det(R: + R;)

t = lN

= log det NId + og dett=l

Here the first te rm equals d log N and the second one tend s to log det Q, andthe result follows. 0

5.B.4. Derivation of margin alization

Throughout this section, assume that

where { e t } is Gaussian distributed with zero mean and variance X and whereQ is a stochastic variable of dimension d. The regression vector pt is known attime t - 1.

This means that the analysis is conditioned on the model. Th e fact t ha t allthe PDF:s are conditioned on a particular model structure will be suppressed.

The simplest case is the joint likelihood for the parameter vector and thenoise variance.

Theorem 5.6 N o marginalization)For given X and Q the PDF of y N is

(5.104)

where v N ( Q ) = C K l ( y t -

Proof: Th e proof is trivial when {pt} is a known sequence. In the general casewhen (pt may contain past disturbances (through feedback), Bayes' rule gives

and the theorem follows by noticing th at

since (pt contains past information only.



5.B Derivations 201

Next, continue with the likelihood involving only th e noise variance. Since

P ( Y N 1 4 = S p ( Y N I 0 , 4 P ( 6 l W ,

the previous theorem can be used.

Theorem 5.7 Marginalization of 6 )The P D F of y N conditioned on the noise variance X when

is given by

where ON is given by



202 AdaDtive

Using this together with (5.104) gives

and thus

(5.109)

(5.110)

which proves the theorem. 0

The likelihood involving the parameter vector 8 is given by

P ( Y N?

= 1 (yNlQ7 4 P ( x l W .

Computing this integral gives the next theorem.

Theorem 5.8 (Marginalization of X)The PDF of y N conditioned on I9 when X is W- l ( m , a ) is given by



5.B Derivations 203

Proof: In order to prove (5.111), write

Then (5.111) follows by identifying the integral with a multiple of an integralof an W- ' ( V N Q), + m + d ) PDF. 0

The distribution of the data conditioned only on the model structure isgiven by Theorem 5.9.

Theorem 5.9 Marginalization of Q and X)T h e PDF of y N when X i s W-l(rn , o) nd Q is distributed as in (5.105) isg iven by

where v ~ ( 0 )nd 8 are defined by (5.112) and (5.107), respectively.

Remark 5.10Remem ber tha t the theorem gives the po s ter ior d ist r ibut ionof the da ta con-dit ioned on a certain model s tructure, so p ( y N I M ) s more exac t .

Proof: From (5.106) and the definition of th e Wishart distribution (3.54) itfollows

(5.115)

(5.116)



204 Adaptive filtering

which proves the theorem.



Change detection based onsliding windows

6.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2056.2. Distance measures . . . . . . . . . . . . . . . . . . . . . . 211

6.2.1. Prediction rror . . . . . . . . . . . . . . . . . . . . . . . . 2116.2.2. Generalized ikelihood ratio . . . . . . . . . . . . . . . . . 2126.2.3. Information ased orms . . . . . . . . . . . . . . . . . . 212

6.2.4. The divergence test . . . . . . . . . . . . . . . . . . . . . 212

6.2.5. The asym ptot ic local approach . . . . . . . . . . . . . . . 214

6.2.6. General parallel filters . . . . . . . . . . . . . . . . . . . . 217

6.3. Likelihood based detection and isolation . . . . . . . . 18

6.3.1. Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2186.3.2. A general approach . . . . . . . . . . . . . . . . . . . . . . 2 2 1

6.3.3. Diagnosis of paramete r and variance changes . . . . . . . 2 2 1

6.4. Design optimization . . . . . . . . . . . . . . . . . . . . . 225

6.5. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.5.1. Ra t EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.5.2. Belching heep . . . . . . . . . . . . . . . . . . . . . . . . 227

6.5.3. Application to digital ommunication . . . . . . . . . . . 229

6.1. Basics

Model validation is the problem of deciding whether observed data ar e consis-tent with a nominal model . Change detection based on model validation aimsat applying a consistency test in one of the following ways:

The data are taken from a liding window This is the typical applicationof model validation

Th e da ta are aken from an increasing window . This is one way tomotivate the local approach . The detec tor becomes more sensitive whenth e dat a size increases. by looking for smaller and smaller changes .





206 Chanae detection based on slidina indows

The nominal model will be represented by the parame ter vector 190. This maybe obta ined in one of the following ways:

Q0 is recursively identified from past da ta , except for th e ones in thesliding window. This will be our typical case.

80 corresponds to a nominal model, obtained from physical modeling orsystem identification.

The s tandard setu p is illustrated in (6.1):

A model Q ) based on data from a sliding window of size L is compared toa model 00) based on all past dat a or a substantially larger sliding window.Let us denote the vector of L measurements in the sliding window by Y =(Y:-~+~, . . . , y E l , Y ? ) ~ . Note the convention that Y is a column vector ofdimension Ln5, ( n g = dim(y)). We will, as usual in this par t, assume scalarmeasurements. In a linear regression model, Y can be written

where E = . . , e:-l e:)T is the vector of noise components and heregression matrix is = ( ( ~ t - ~ + 1 , . . ,pt-1 pt . Th e noise variance is E e:) =X.

We want to test the following hypotheses:

H0 : The parameter vectors are he same, Q = 130. That is, the model isvalidated.

H1 : The parameter vector Q is significantly different from 80, and the nullhypothesis can be rejected.

We will argue, from the xamples below, that all plausible change detectiontests can be expressed in one of the following ways:

1. The parameter estimation error is e = M Y - 90, where M is a matrix tobe specified. Standard statistical tests can be applied from the Gaussianassumption on the noise

e = M Y 90 E N ( O , P ) , under H o .

Both M and P are provided by the method.



6.1 Basics 207

2. The tes t sta tis tic is the norm of the simulation error, which is denoteda loss function V in resemblance with the weighted least squares lossfunction:

V = IIY Yolli a ( Y Yo ) ~ Q ( Y Yo).

The above generalizations hold for our standard models only when the noisevariance is known. Th e case of unknown or changing variance is treated inlater sections, and leads to t he same kind of projection interpretations, butwith non-linear transformations (logarithms).

In the examples below, there is a certain geometric interpretation in th a t Qturns out to be a projection matrix, i.e., Q = Q. Th e figure below (adoptedfrom Section 13.1) is useful in the following calculations for illustrating the

geometrical properties of the least squares solution (Y = YO+ E below):

Y

Example 6.7 Linear regression and parameter norm

Standard calculations give

The Gaussian distribution requires that the noise is Gaussian. Otherwise, thedistribution is only asymptotically Gaussian. A logical test statistic is



208 Chanae detection basednlidinaindows

which is as indicated x distributed with d = dim(8) degrees of freedom. Astandard table can be used t o design a threshold, so the test becomes

118- 8011 -1 3 h .

The alte rna tive formulation is derived by using a simulated signal YO= @80,

which means we can write 80 = ( @ Q T ) and P = X ( @ Q T ) l , so

Q= XP1(Y Y o ) ~ Q ( YYo).

Note that Q = QT (@QT)- ' is a projection matrix Q = Q. Here we haveassumed persistent excitation so that @QT is invertible.

It should be noted that the alternative norm

118 - 8011~7

without weighting with the inverse covariance matrix, is rather na'ive. Oneexample being an AR model, where the last coefficient is very small (it isthe product of all poles, which are all less than on e) , so a change in it willnot affect this norm very much, although the poles of the model can changedramatically.

Example 6.2 Linear regression and GLR

The Likelihood Rat io ( L R )for testing the hypothesis of a new parametervector and the nominal one is

Assuming Gaussian noise with constant variance we get the log likelihood ratio

( L W



6.1 Basics 209

Replacing the unknown parameter vector by its most likely value (the maxi-mum likelihood estimate @, we get the Generalized Likelihood Ratio ( G L R )

V G L R 2X GLR

= IIY @TBo112- IIY - @TB112

= IIY Yo112 IIY P112

= Y T Y - 2YTyo + YZyo - YTY + 2YTY - YTP

= -2YTyo + YZyo + 2YTP - PTP= -2YTQ& + Y,TQfi + 2YTQY - YTQY

= -2YTQ& + Y,TQfi + YTQY

= ( Y i ) T Q ( Y -

y0= IIY YOll&

This idea of combining G LR and a sliding window was proposed in Appel andBrandt (1983).

Example 6.3 Linear regression and model d ifferences

A loss function, not very common in the lit eratu re , is the sum of modeldifferences rather than prediction errors. The loss function based on modeldigerences MD) s

V M D I p T B o @ 8112

= l lyo- YII&

T A

= IIQYo QYll;

which is again the same norm.

Example 6.4 Linear regression and divergence test

and is reviewed in Section 6.2. Assuming constant noise variance, it givesThe divergence test was proposed in Basseville and Benveniste (1983b),

V D I V = IIY @ B O I I (Y @'TBO)T(Y @ V

= IIY Yolli (Y Y o ) ~ ( Y QY)

= Y T Y + Y,TYo 2YTY Y T Y + YTQY + YZY YZQY

= Y,TQYo 2YTQY + YTQY + YZQY YZQY

= IIY YOll&



210 Chanae detection based nlidinaindows

Again, the same distance measure is obtained.

Example 6.5 Linear regression and the local approach

The test statistic in the local approach reviewed in Section 6.2.5 is

See Section 6.2 .5 for details. Since a test statist ic does not lose informationduring linear transformation, we can equivalently take

vLA 2/Nprl E ASN(B ,P ) ,

and we are essentially back to Example 6.1.

To summarize, the test statist ic s (asymptotically) th e same for all of thelinear regression examples above, and can be written as the two-norm of theprojection Q(Y - Yo),

IIY yell; E x 2 4or as the parameter estimation error

8 - 00 E N ( 0 , P ) .

These methods coincide for Gaussian noise with known variance, otherwisethey are generally different.

There a re similar methods for sta te space models. A possible approach isshown in the example below.

Example 6.6 State space model with additive changesSta te space models are discussed in the next part , but this example defines

a parameter estimation problem in a state space framework:

Zt+1 = Atxt +Bu,tut +Bv,tvt +B6,tO 6.2)

yt = Ctxt + et +Du,tut +De,tO. 6.3)

The Kalman filter applied to an augmented state space model



6.2 Distanceeasures 21 1

gives a parameter estimator

which can be expanded to a linear function of da ta , where the par ameterestimate after L measurements can be written

8 = LyY +LuU E N(0, Pi ' ) , under H o .

Here we have split the Kalman filter quantities as

In a general and somewhat abs tract way, the idea of a consistency testis to compute a residual vector as a linear transformation of a batch of data,for instance taken from a sliding window, E = AiY + bi. Th e transformationmatrices depend on the approach. The norm of the residual can be taken asthe distance measure

9 = IlAiY + bill

between the hypothesis H1 and H0 (no change/fault). The statistical approachin this chapter decides if the size of th e distance measure s statistically signifi-cant, and this test s repeated at each time instant. This can be compared withthe approach in Chapter 11 where algebraic projections are used t o decidesignificance in a non-probabilistic framework.

6.2. Distance measures

We here review some proposed distance functions. In contrast to the examplesin Section 6.1, the possibility of a changing noise variance is included.

6 2 1 Prediction error

A test statist ic proposed in Segen and Sanderson (1980) is based on th e pre-diction error

Here X is the nominal variance on the noise before the change. This statis ticis small if no jump occurs and star ts to grow after a jump.



21 2 Chanae detection based on slidina indows

6.2.2. Generalized ikelihood atio

In Basseville and Benveniste (1983c), two different test statistics for th e case oftwo different models are given. A straightforward extension of the generalizedlikelihood ratio test in Example 6.2 leads to

The tes t statisti c (6.7) was at th e same time proposed in Appel and Brandt(1983), and will in the sequel be referred t o as Brandt's GLR method.

6.2.3. nformation based norms

To measure the distance between two models, any norm can be used, and wewill here outline some general statistical information based approaches, seeKumamaru et al. (1989) for details and a number of alternatives. Firs t, theKullback discrimination information between two probability density functionsp 1 and p2 is defined as

with equality only if p 1 (X) = pz(z). In the pecial case of Gaussian distributionwe are focusing on, we get

Pi(X:) = N(&, Pi

The Kullback information is not a norm and thus not suitable as a distancemeasure, simply because it is not symmetric 1(1 ,2 ) 1(2,1) . However, thisminor problem is easily resolved, and the Kullback divergence is defined as

V(1,2) = 1 ( 1 , 2 )+ 1(2 ,1 ) 0.

6.2.4. The divergence test

From the Kullback divergence, the divergence test can be derived and it is anextension of the ideas leading to (6.6). It equals

(Y @ TBO) T(Y- 2




Table 6.1. Estimated change times for different methods .

Signal

445 645 1550 1800 2151 2797 36266ivergenceiltered

593 1450 2125830626randt’s GLRoisy451 611 1450 1900 21258306266randt’s GLRoisy

451 611 1450 1900 21258306266ivergenceoisy

Estimated change timesaethod

I I I

Filtered I Brandt’s GLR I 16 I 445 645 1550 1800 2151 2797 3626Filtered I Brandt’s GLR I 2 I 445 645 1550 1750 2151 2797 3400 626

The corresponding algorithm will be called the divergence test. Both hesesta tis tics start to grow when a ju mp has occured, and again the task of th estopping rule is to decide whether the growth is significant. Some other pro-

posed distance measures, n the context of speech processing, are listed nde Souza and Thomson (1982).

These two statistics are evaluated on a number of real speech data sets inAndre-Obrecht (1988) for the growing window approach. A similar investiga-tion with the same da ta is found in Example 6.7 below.

Example 6.7 Speech segmentation

To illustrate an application where the divergence and GLR ests have beenapplied, a speech recognition system for use in cars is studied. The first taskof this system, which is the target of our example, is to segment the signal.

Th e speech signal under consideration was recorded inside a car by theFrench National Agency for Telecommunications as described by Andre-Obrecht(1988). The sampling frequency is 12.8 kHz, and a part of the signal is shownin Figure 6.1, together with a high-pass filtered version with cut-off frequency150 Hz, and the resolution is 16 bits.

Two segmentation methods were applied and tuned to these signals inAndre-Obrecht (1988). The methods are he divergence test and Brandt’sGLR algorithm. Th e sliding window size is L = 160, the threshold h = 40and the drift parameter v = 0.2. For the pre-filtered signal, a simple detectorfor finding voiced and unvoiced parts of the speech is used as a first step. Inthe case of unvoiced speech, the design parameters are changed to h = 80and U = 0.8. A summary of the results is given in Table 6.1, and is alsofound in Basseville and Nikiforov (1993), for th e same par t of th e signal asconsidered here. In the cited reference, see Figure 11.14 for the divergencetest and Figures 11.18 and 11.20 for Brandt’s GLR est. A comparison to afilter bank approach is given in Section 7.7.2.




Speech data with car noise

0

5000Speech data with pre-filter

-5000'0 0.05 0.1 0.15.2.25 0 3

Time [S]

Figure 6.1. A speech signal recorded in a car (upper plot) and a high-pass filtered version(lower plot).

6.2.5. The asymptotic local approach

The asymptotic local approach was proposed in Benveniste et al. (1987a) as ameans for monitoring any adaptive parameter estimation algorithm or abruptparameter changes. The method is revisited and generalized to non-linearsystems in Zhang et al. (1994).

The size of t he da ta record L will be kept as an index in this section. Th ehypothesis test is

1H I : Change t ~ 130 + -v.

L(6.10)

The assumed alternate hypothesis may look strange at first glance. Whyshould the change by any physical reason become less when time evolves?There is no reason. The correct interpretation is th at th e hypothesis makesuse of the fact that the test can be made more sensitive when the number ofda ta increases. In this way, an estima te of th e change magnitude will have acovariance of constant size, ra ther tha n decreasing like 1/L. Other approachesdescribed in Section 6.1 mplicitly have this property, since the covariancematrix P decays like one over L. The main advantages of this hypothesis testare the following:

The asymptotic local approach, which is standa rd in statistics, can beapplied. Thus, asymptotic analysis is facilitated. Note, however, from



21 6 Chanae detection based nlidinaindows

Example 6.8 Asymptotic local approach for linear regression model

Consider as a special case the linear regression model, for which thesedefinitions become quite ntuitive. For a linear regression model with thefollowing stan dard definitions:

fL

PL = XRi1

8, E AsN(6' - 0 0 ,PL) .

the data, primary residual and improved residual (quasi-score) are defined asfollows:

Using th e least squares statistics above, we can rewrite th e definition of q ~ B 0 )

as

Thus , it follows th at th e asymptotic distribution is

Note tha t the covariance matrix ;Pi1 tends to a constant matrix C ( & ) when-ever the elements in the regressor are quasi-stationary.




The last remark is one of th e key points in the asymptotic approach. Thescaling of the change makes the covariance matrix independent of the slidingwindow size, and thus the algor ithm has constant sensitivity.

The primary residual K ( Z k , 30 resembles the upda te step in an adaptivealgorithm such as RLS or LMS. One interpretation is th a t under t he no changehypothesis, then K(Z,+, 0 ) AI3 in the adaptive algorithm. Th e detectionalgorithm proposed in Hagglund (1983) is related t o this approach, see also(5.63).

Assume now that 7 E AsN(Mv, C ) , where we have dropped indices forsimplicity. A standard Gaussian hypothesis test t o test H0 : v = 0 can beused in the case that 7 is scalar (see Example 3.6). How can we obtain ahypothesis test in the case th a t is not scalar? If M is a square matrix, a x2test is readily obtained by noting that

M - l q E AsN(v, M - l C M - T )

( M - I C M - T ) - ~ / ~ E A ~ N ( ( M- I C M- T) - ' / ~ v , I ~ ,

V

vTw E Asx2(n,),

where the last distribution holds when v = 0. A hypothesis test threshold is

taken from the x distribution. The difficulty occurs when M is a thin matrix,when a projection is needed. Introduce the test statistic

W = ( M T C - l M ) MTC-lq.1/2

We have

E(w) = ( M T C - 1 M ) - 1 / 2M TC - l M v

= ( M T C - q q 1 ' 2 v

Cov(.) = ( M T C - % p 2 M T C - l M ( M T C - I M ) 112 = Inn,.

We have now verified that the test sta tis tic satisfies ?W E x 2 ( n v ) under Ho,so again a standard test can be applied.

6.2.6. General parallel filters

Th e idea of parallel filtering is more general th an doing model validation ona batch of data . Figure 6.2 illustra tes how two adaptive linear filters withdifferent adaptation rates are run in parallel. For example, we can take oneRLS filter with forgetting factor 0.999 as th e slow filter and one LS estimatorover a sliding window of size 20 as the fast filter. Th e tas k of th e slow filter




I I I

Switch

4 Filter

Figure 6.2. Two parallel filters. One slow to get good noise attenuat ion and one fast t o getfast tracking.

is to produce low variance state e stimates in normal mode, and the fast filtertakes over after abrupt of fast changes. The switch makes the decision. Usingthe guidelines from Section 6.1, we can test the size of

(4 - O ) T ( p l ) - l ( - 40 *

The idea is that P' is a valid measure of th e uncertainty in Q while thecovariance PO of t he slow filter is negligible.

6.3. likelihood based detection and solation

6.3.1. Diagnosis

Diagnosis is the combined task of detection and isolation. Isolation is theproblem of deciding what has happened after th e change detector has noticedthat something has happened.

Let the parame ter vector be divided into two parts,

Y = aTO = ( ; ; )T (Bg .

Here the paramete r might have been reordered such that the im portan t onesto diagnose come first. Th e difference between change detection and isolationcan be stated as

Detection H0 : 0 = 00 H I : 0 00Isolation H: : Oa = 0; H; : Oa 0;

Note that there might be several isolation tests. For instance, we may wantto test H ; at the same time, or another partitioning of the parameter vector.There are two ways to tr eat the par t Ob of the parameter vector not includedin the test:



6.3 Likelihoodased detectionnd isolation 21 9

o b = @, O the fault in Ba does not influence the other elements in theparameter vector.

Obis a nuisance, and its value before and after a fault in Ba is unknownand irrelevant.

The notation below is that Y denotes the vector of measurements and Yo =Q T O O is the result of a simulation of the nominal model. The first alternativegives the following GLR (where Ba = 19;+ ABa is used):

The second alternative is

We now make some comments on these results:

For detection, we use the test stat ist ic V = (Y Y ~ I > ~ Q ( YYo). Thistest is sensitive to all changes in 19.




For isolation, we compute either V 1 r V 2 depending upon the philoso-phy used) or different sub-vectors of I and Obbeing two possibilities

~ corresponding to different errors. Th e result of isolation is th at th e

sub-vector with smallest loss V 3 has changed.

The spaces { : & X = } and { : Qbx = } can be nterpreted assubspaces of the { : Qx = } subspace of R L .

There is no simple way to compute Qb from Q.

Qb Q is no projection matrix.

It can be shown that Q Qb + Q , with equality only when Q is blockdiagonal, which happens when = 0. Thismeans that he secondalternative gives a smaller test statistic.

Th e geometrical interpretation is as follows. V 1 s the par t of th e residualenergy V = IIY -YolIt th at belongs to th e ubspace generated by the projection

Q . Similarly, V 2 s the residual energy V sub tracted by part of it that belongsto the subspace generated by the projection Qb. The measures V 1 nd V 2 re

equal only if these subspaces (Q and Qb) are orthogonal.

Example 6.9 Diagnosis of sensor faults

Suppose th at we measure signals from two sensors, where each one can besubject to an offset ( fault) . After removing the known signal pa rt , a simplemodel of the offset problem is

Y t = (;;) + e t .

Th at is, the measurements here take th e role of residuals in the general case.Suppose also tha t th e nominal model has no sensor offset, so 130 = 0 andYO= 0. Consider a change detection and isolation algorithm using a slidingwindow of size L. First, for detection, the following distance measure shouldbe used

t 2

i=t-L+l j = 1



6.3 Likelihoodased detectionnd isolation 221

Secondly, for fault isolation, we compute the following two measures for ap-proach 1:

t

@'=diag (1 ,0 ,1 ,0 ... , O ) + Va = Y T aY = c (yi 1 ) 2i=t-L+l

t

@'=diag(0 ,1 ,0 ,1 , . . . , O , l ) + V b = Y bY = c (yi 2) ) .

Note that Y is a column vector, and the projection Q picks out every secondentry in Y , hat is every sensor 1 sample, and similarly for Q b . As a remark,approach 2 will coincide in this example, since Q = Q Q b in this example.

To simplify isolation, we can make a table with possible faults and their

i=t-L+l

influence of the distance measures:

m2 0 1

Here 1 should be read 'large' and 0 means 'small' in some measure. Com-pare with Figure 11.1 and the approaches in Chapter 11.

6.3.2. A general approachGeneral approaches to detection and isolation using likelihood-based methodscan be derived from the formulas in Section 5.B. Consider a general expressionfor the likelihood of a hypothesis

p ( Y I H J = p ( Y l B= B i , X = Xi).

Several hypotheses may be considered simultaneously, each one consideringchanges in different subsets of the parameters, which can be divided in subsets

as P , b X). Nuisance parameters can be treated as described in ection 6.3.1,by nullifying or estimating them. A third alternative is marginalization. Th elikelihoods are computed exactly as described in Section 5.B.4. This sectionoutlines how this can be chieved for the particular case of isolating parametricand variance changes. See also the application in Section 6.5.3.

6.3.3. Diagnosis of parameter and variance changes

The following signal model can be used in order to determine whether anabrupt change in the parameters or noise variance has occurred at time t o :



6.3 Likelihoodased detectionnd isolation 223

probabilities qi for each hypothesis can be incorporated . The exact a posterioriprobabilities

are derived below.Assuming that H i , = 0,1 ,2 , is Bernoulli distributed with probability qi ,

i.e.

does not hold, with probability 1 - i

holds, with probability qi,(6.16)

logp(Hi) is given by

logp(Hz) = log ( q t ( 1 q p + n 1 - 2 )

= 2 log(qJ + (no + 721 - 2) og(1 - qz) , = 0,1,2. (6.17)

Consider model (6.11), where e E N 0 ,X). For marginalization purposes, theprior distribution on X can be taken as inverse Wishart. The inverse Wishartdistribution has two parameters, m and a , and is denoted by W-’(m, a ) . Itsprobability density function is given by

Th e expected mean value of X is

E(X) =a

m - 2

and the variance is given by

2a2(m - 2)2(m - 4)Var(X) =

(6.18)

(6.19)

(6.20)

Th e mean value (6.19) and noise variance (6.20) are design parameters. Fromthese, the Wishart pa rameter m and a can be computed.




Algorithm 6 1 Diagnosis of parameter and variance changes

Consider the signal model (6.11) and the hypotheses given in (6.12). Let theprior for X be as in (6.18) and the prior for the parameter vector be 3 E

N(0, PO). With the loss function (6.14) and standa rd least squares estimation,the a posteriori probabilities are approximately given by

+ log d e t (P t l +P c 1 ) + 2og(qo), (6.21)

og det PO - og det P1 + 2 log(q l), (6.22)

12 =(no 2 +m ) og ( n o - 4 )o(ao) + (n1 - 2 + m ) og

2 log det PO + 2 log(q2). (6.23)

Derivation: Using the same type f calculations as in Section 5.B.4, th e follow-ing a posteriori probabilities can be derived. They are the sum f the negativelog likelihood and the prior in (6.17):

(6.24)



6.4 Desiann 225

(6.26)

By using

no+nl

Di = log det PO og det Pi + c og det Rt, (6.27)

removing terms tha t are mall for large t and L , t >> L , and removing constantstha t are equal in Equations (6.24), (6.25) and (6.26), and Stirling’s formulal7 ( n + 1) fi nn S1 /2 e- n, th e approximate formulas are achieved.

t= l

6.4. Design optimization

It is wellknown tha t low order models are usually preferred to th e full modelin change detection. An example is speech signals, where a high order AR

model is used for modeling, but a second order AR model is sufficient for seg-mentation; see Section 11.1.3 in Basseville and Nikiforov (1993), and Example6.7. One advantage of using low order models is of course lower complexity.Another heuristic argument is th a t, since th e variance on the model is propor-tional to the model order, a change can be determined faster with a constantsignificance level for a low order model. The price paid is tha t certain changesare not visible in the model anymore, and are thus not detectable. A goodexample is a FIR model for the impulse response of an I IR filter; changes inthe impulse response beyond the FIR truncation level are not detectable.

Model order selection for change detectionTh e smaller the model order, th e easier it is to detect changes. Areduced order model implies that there are cer tain subspaces ofthe parameter vector th at ar e not detectable. By proper modelreduction, these subspaces can be esigned so that they would notbe detectable in any model order due to th e a poor signal-to-noiseratio for that change.

It is customary t o fix the significance level, th a t is th e probability for falsealarms , and t ry to maximize the power of the te st or , equivalently, minimize



226 Change detection basednlidingindows

th e average delay for detection. Th e power is of course a function of the truechange, and using low order models there are always some changes tha t cannotbe detected. However, for large enough model orders he variance contribution

will inevitably outweigh the contribution from the sys tem change. These ideaswere first presented in Ninness and Goodwin (1991) and Ninness and Goodwin(February, 1992), and then refined in Gustafsson and Ninness (1995).

There is a close link between this work and the identification of transferfunctions, where the trade-off between bias and variance is wellknown; see, forinstance, Ljung (1985). In th e cited work, an expression for the variance termis derived which is asymptotic in model order and number f data. Asymptot-ically, the variance term is proportional to th e model order and since the biasis decreasing in the model order, this shows th at the model order cannot beincreased indefinitely. That is, a finite model order gives the smallest overallerror in the estimate, although the true system ight be infinitely dimensional.

Example 6.70 Under-modeling of an FlR system

In Gustafsson and Ninness (1995), an asymptotic analysis is presented forthe case of 8 being the impulse response of a system, using an FIR model.

The data are generated by the so called ‘Astrom’ system under the no changehypothesis

q P 1 + 0.5qP2Yt =

1 1.5q-1 + 0.7qP2ut +et.

The change is a shift in th e phase angle of the complex pole pair from 0.46to 0.4. Th e corresponding mpulse responses are plotted n Figure 6.4(a).Both the input and noise are white Gaussian noise with variance 1 and 0.1,respectively. Th e number of data in the sliding window is L = 50.

Figure 6.4(b) shows a Monte Carlo simulation or 1000 noise realizations. Ax test was designed to give the desired confidence level. The upper plot showsthe chosen and obtained confidence level. The lower plot shows the asymptoticpower function (which can be pre-computed) and the result from th e MonteCarlo simulation. Qualitatively, they are very similar, with local max ima andminima where expected and a large increase between model orders 4 and 8.Th e power from th e Monte Carlo simulation is, however, much smaller, whichdepends on a crude approximation of a x distribution that probably could berefined.



6.5 Amlications 227

0.351Impulse response before and afler change

I

0.3 -

0.25 -0.2

0.15-

0.1

-

-

0.05 -

0 -

-0.05

0.1

-

-

0.7

0.6

0.5

0.4

0.3 -

0.2

0.1

-

-

-

-

-

-

105 20 25 30Model o rder

Figure 6.4. Impulse response before and after a change for the system under consideration(a). Significance level and power from asymptotic expression and simulation as a functionof the number of parameters in the model (b).

6.5. Applications

6.5.1. Rat EEG

Algorithm 6.1 is applied here to the rat EEG described in Section 2.4.1, usingan AR(2) model. Th e size of th e sliding window is chosen to L = n1 = 100and both types of parameter and variance changes are considered. The resultis

I Changeimes I 1085586945363949632735 II Winningypothesis I 2 2 1 1

That is, the first six changes are due to variance changes, an d the last oneis a change in dynamics, which is minor, as seen from the parameter plot inFigure 6.5.

6.5.2. Belching sheep

In developing medicine for asthma, sheep are used to s tudy its effects as de-scribed in Section 2.5.2. Th e dynamics between air volume and pressure isbelieved to depend on the medicine’s effect. Th e dynamics can accurately beadaptively modeled by an ARX model. The main problem is the so-calledoutliers: segments with bad da ta , caused by belches. One should hereforedetect the belch segments, and remove them before modeling. We have heretaken a batch of da ta (shown in Figure 6.6), estimated an ARX(5,5,0) modelto all da ta a nd applied Algorithm 6.1 as a variance change detector.




101Rat EEG and segmentation of AR 2) model and noise variance

I I I

1 q1 .loo500 I000 1500 2000500

0.51

-0.5 ll

0 500 1000 1500 2000500 3000 3500 4000-1

500 1000 1500 2000 2500 3000 3500 4000Time [samples]

Figure 6.5. EEG for a rat and segmented noise variance from an AR 2) model.

50001Pressur e and air flow

I

0

-5000; I1000 2000 3000 4000 5000 6000 7000 8000

1o ~ Filtered model residual

I1

Figure 6.6. ARX residuals. Second: low-pass filtered and segmented noise variance



6.5 Amlications 229

That is, only hypothesis H2 is considered. Th e result is illustrated in thelower plot in Figure 6.6. The solid line is a low-pass filtered version of E ; ,

and the dashed line the detected belch segments, where the variance is much

higher.

6.5.3. Application to digital communication

In this section, it is demonstrated how Algorithm 6.1 can be applied to de-tect the occurrence of double talk and abru pt changes in th e echo pa th in acommunication system.

In a telephone system, it is important to detect a change in the echo pa thquickly, but not confuse it with double talk, since the echo canceler should

react differently for these two phenomena. Figure 6.7 illustrates the echo pathof a telephone system. The task for th e hybrid is to convert 4-wire to 2-wireconnection, which is used by all end users. This device is not perfect, but apart of the far-end talker is reflected and transmitted back again. The to ta lecho path is commonly modeled by a FIR model. The adaptive FIR filterestimates the echo path and subtrac ts a prediction of th e echo. Usually, theadaptation rate is low, bu t on two events it should be modified:

After an echo pathchange, caused by a completely different physical

connection being used, the adaption rate should be increased.During double talk, which means th at th e near-end listener star ts to alk

f rom far-endtalker U

to near-endlistener

' =Y

FIR-filter0

hybrid

/

f of echoreplica

Y

- f d r ;

toar-end + near-endlistener talker

Figure 6.7. The communication system with an echo return path (from Carlemalm andGustafsson (1998)).




simultaneously, the adapta tion rate should be decreased (t he residualsare large, implying fast ada pta tion ) and must not be increased.

Th at is, correct isolation is far more important than detection. It is betterto do nothing than to change adap tation rate in the wrong direction.conclusion is generalized below.

Fundamental adaptive filtering problemDisturbances and system changes must be solated. An alarmcaused by a system change requires tha t the ada pta tion rat e houldbe increased, while an alarm caused by a disturbance (false alarm)implies that the adaptivity should be frozen.

This

Th e impulse response of a telephone channel is short, i.e. about 4 ms long(i.e. 32 samples with the standar d 8 kHz sampling frequency), bu t since thedelay can be large, the FIR filter is often of a length of between 128-512.

Algorithm 6.1 can be applied directly to th e current problem. HypothesisH1 corresponds to echo path change and H2 to double talk. The details of thedesign and successful numerical evaluation are presented in Carlemalm andGustafsson (1998).



Change detection based onfilter banks

7.1.Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2317.2.Problem setup . . . . . . . . . . . . . . . . . . . . . . . . 233

7.2.1. Thechanging regression model . . . . . . . . . . . . . . . 2337.2.2. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

7.3. Statistical criteria . . . . . . . . . . . . . . . . . . . . . . 234

7.3.1. The MML estimator . . . . . . . . . . . . . . . . . . . . . 2347.3.2. The a posteriori probabilities . . . . . . . . . . . . . . . . 2367.3.3. On he choice of priors . . . . . . . . . . . . . . . . . . . . 237

7.4. Information based criteria . . . . . . . . . . . . . . . . . 40

7.4.1. The MGL estimator . . . . . . . . . . . . . . . . . . . . . 2407.4.2. MGL with penalty erm . . . . . . . . . . . . . . . . . . . 2417.4.3. Relation o MML . . . . . . . . . . . . . . . . . . . . . . . 242

7.5. On-line local search for optimum . . . . . . . . . . . . . 42

7.5.1. Local tree earch . . . . . . . . . . . . . . . . . . . . . . . 2437.5.2. Design parameters . . . . . . . . . . . . . . . . . . . . . . 245

7.6 . Off-line glob al search for optimum . . . . . . . . . . . . 45

7.7. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 246

7.7.1. Storing EKG signals . . . . . . . . . . . . . . . . . . . . . 2477.7.2. Speech egmentation . . . . . . . . . . . . . . . . . . . . . 248

7.7.3. Segmentation of a car’s driven path . . . . . . . . . . . . 2497.A. Two inequalities for likelihoods . . . . . . . . . . . . . . 52

7 . A . l . The first nequality . . . . . . . . . . . . . . . . . . . . . . 2527 .A.2 . The econd nequality . . . . . . . . . . . . . . . . . . . . 2547 .A.3 . The exact pruning algorithm . . . . . . . . . . . . . . . . 255

7.B . Th e posterior probabilities of jump sequence . . . . 56

7 .B .1 . Main heorems . . . . . . . . . . . . . . . . . . . . . . . . 256

7.1. Basics

Let us star t with considering change detection in linear regressions as an off-line problem. which will be referred to as segmentation. The goal is to find a





232 Chanae detection based on filteranks

sequence of time indices k n = ( k l , 2 , ..,k n ) , where both the number n and thelocations ki are unknown, such th at a linear regression model with piecewiseconstant parameters,

is a good description of the observed signal yt. In this chapter, th e measure-ments may be vector valued, and the nominal covariance matrix of the noiseis Rt, and X i ) is a possibly unknown scaling, which is piecewise constant.

One way to guarantee that the best possible solution is found, is to con-sider all possible segmentations k n , estimate one linear regression model ineach segment, and then choose th e particular k n th a t minimizes an optimality

criteria, k n = arg min V ( k n ) .n>l,O<kl<...<k,=N

The procedure and, as it turns out, sufficient statistics as are defined in (7.6)-(7.8), are shown below:

What is needed from each data segment is the sufficient statistics V (sumof squared residuals), D (- log det of the covariance matrix ) and number ofda ta N in each segment, as defined in equat ions (7 .6) , (7 .7) and (7 .8) . Thesegmentation k n has n - 1 degrees of freedom.

Two types of optimality criteria have been proposed:

Statistical criterion: the maximum likelihood or maximum a posteriori

estimate of k n is studied.

Information based criterion: the information of data in each segment isV ( i ) the sum of squared residuals), and he to ta l nformation is the sumof these. Since the total information is minimized for the degeneratedsolution k n 1 , 2 , 3 ,..., N , giving V ( i ) 0, a penal ty te rm is needed.Similar problems have been studied in th e context of model structureselection, and from this literature Akaike's AIC an d BIC criteria havebeen proposed for segmentation.

The real challenge in segmentation is to cope with th e curse of dimensionality.The number of segmentations k n is 2N (there can be either a change or nochange at each time instant). Here, several strategies have been proposed:



7.2 Problem setuD 2 3 3

Numerical searches based on dynamic programming or MCMC tech-niques.

Recursive local search schemes.

The main part of this chapter is devoted to th e second approach, which pro-vides a solution to adaptive filtering, which is an on-line problem.

7.2. Problem setup

7.2.1. The changing regression model

Th e segmentation model is based on a linear regression with piecewise constant

parameters,

yt = cp;O i) + et, when ki-1 < t 5 ki. (7.2)

Here O i ) is the &dimensional parameter vector in segment i pt is the regressorand ki denotes the change times. Th e measurement vector is assumed to havedimension p . Th e noise et in (7.2) is assumed to be Gaussian with varianceX i)Rt, where X i ) is a possibly segment dependent scaling of the noise. Wewill assume Rt to be known and the scaling as a possibly unknown parameter.

The problem is now to estimate the number of segments n and the sequenceof change times, denoted kn = ( k l , 2 , ..,kn . Note th at both th e number nand positions of change times ki are considered unknown.

Two important special cases of (7.2) are a changing mean model wherecpt 1 and an auto-regression, where pt (-yt-1, .., -yt-d) T .

For t he analysis in Section 7.A, and for defining the prior on each seg-menta tion, the following equivalent sta te space model turns out be be moreconvenient:

Here St is a binary variable, which equals one when th e parameter vectorchanges and is zero otherwise, and ut is a sequence of unknown parametervectors. Putting St 0 into (7.3) gives a standard regression model with con-stan t parame ters, but when St = 1 it is assigned a completely new parametervector ut taken a t random. Thus, models (7.3) and (7.2) are equivalent. Forconvenience, it is assumed that ko 0 and 60 1, so the first segment be-gins at time 1. Th e segmentation problem can be formulated as estimatingthe number of jumps n and the jump instants k n , or alternatively the jumpparameter sequence S N = (61, ..,S,).




The models ( 7 . 3 ) and (7.2) will be referred to as changing regressions,because they change between different regression models. The most impor-tant feature with the changing regression model is tha t th e ju mp s divide the

measurements into a number of independent segments. This follows, since th eparameter vectors in the different segments are ndependent; hey are twodifferent samples of the stochast ic process { u t } .

A related model studied in Andersson 1985) is a ju m pin g regression model.The difference to the approach herein, is that the changes are added to theparemeter vector. In (7.3), this would mean th at he parameter variationmodel is +l = Ot + &ut. We lose the property of independent segments. Th eoptimal algorithms proposed here are then only sub-optimal.

7.2.2. otation

Given a segmentation k n , it will be useful to introduce compact notation Y ( i )for the measurements n th e it h segment, tha t is Yk i - l + l , ..,yki yk;-l+l.

The least squares estimate and its covariance matrix for the i th segment aredenoted:

k .

Although these are off-line expressions, 8(i) and P ( i ) can of course be com-puted recursively using the Recursive Least Squares (RLS) scheme.

Finally, the following quantit ies will be shown to represent sufficient statis-tics in each segment:

7.3. Statistical criteria

7.3.1. The MM1 estimatorLet k n ,On An denote the sets f ju mp times, parameter vectors and noise scal-ings, respectively, needed in th e signal model (7.2). Th e likelihood for da ta



7.3 Statistical 235

y N given all parameters is denoted p ( y N I k n ,P , n). We will assume indepen-dentGaussian noise distributions, so p ( e N ) nt -l+l27rA(i))-P/2-

(detxp(-e?RF1et/(2X(i))). hen, we have

2logP(Y Ik on An)n

N n

N p og(27r) +c og det Rt +c ( i ) og(A(i)p)t = l i = l

Here and in the sequel, p is the dimension of the measurement vector yt.

There are two ways of eliminating the nuisance parameters P , n, leading tothe marginalixed and generalized likelihoods, respectively. Th e la tter is thestandard approach where the nuisance parameters are removed by minimiza-tion of (7.9). A relation between these is given in Section 7.4. See Wald (1947)for a discussion on generalized and marginalized (or weighted) likelihoods.

We next investigate the use of the marginalized likelihood, where (7.9) isintegrated with respect o a prior distribution of the nuisance parameters. Thelikelihood given only k n is then given by

p ( y N I k n ,P , n)p en IXn)p Xn)dendAn. (7.10)n,X

In this expression, the prior for 19 p ( P I A n ) , is technically a function of thenoise variance scaling X, but is usually chosen as an independent function. Themaximum likelihood estimator is given by maximization of p ( y N I k n ) . Finally,the a posteriori probabilities can be computed from Bayes' law,

(7.11)

where p ( y N ) is just a constant, and the M a x i m u m A posteriori Probability( M A P )estimate is given by maximization. In this way, a prior on th e segmen-tation can be included. In the sequel, only the more general M A P estimatoris considered.

The prior p ( k n )= p knln)p n) r, equivalently, p ( S N )on the segmentationis a user's choice (in fact the only one). A natural and powerful possibilityis to use p ( J N ) and assume a fixed probability q of ju mp at each new timeins tan t. Tha t is, consider the jump sequence S N as independent Bernoullivariables 6 E Be(q), which means

0 with probability 1 - qJ t = { 1 with probability Q.



236 Chanae detection basednilteranks

It might be useful in some applications to tu ne th e ju mp robability q above,because it controls he number of jumps estimated. Since there is a one-to-onecorrespondence between kn and b N , both priors are given by

p ( k ) p ( P ) - 4 )N - n (7.12)

A q less than 0.5 penalizes a large number of segments. A non-informativeprior p ( k n ) 0.5N is obtained with q 0.5. In this case, the MAP estimatorequals the Muzirnurn Likelihood ( M L )estimator, which follows from (7.11).

7.3.2. The a posteriori probabilities

In Appendix 7.B, the U posteriori probabilities are derived in three theoremsfor the three different cases of treatin g the measurement covariance: com-pletely known, known except for a constant scaling and finally known withan unknown changing scaling. The case of completely unknown covariancematrix is not solved in the lite rature. These are generalizations and exten-sions of results for a changing mean models cpt = 1) presented in Chapter 3;see also Smith (1975) and Lee and Hefhinian (1978). Appendix 7.B also con-tains a discussion and motivation of the particular prior distributions used inmarginalization. Th e different steps in the MAP estimato r can be ummarizedas follows; see also (7.16).

Filter bank segmentation

Examine every possible segmentation, parameterized in thenumber of jumps n and jump times k n , separately.

For each segmentation, compute th e best models in eachsegment parameterized n the least squares estimates 8 i )and their covariance matrices P ( i ) .

Compute he sum of squared prediction errors V ( i ) andD ( i ) - og det P ( i ) n each segment.

The MAP estimate of th e model structu re for the hreedifferent assumptions on noise scaling (known X i ) Xunknown but constant X i ) = X and finally unknown andchanging X i ) )is given in equations (7.13), (7.14) and (7.15),respectively,



7 . 3 Statistical criteria 237

nG = a r g m i n x D ( i ) + ( N p - n d - 2 ) l o g x V ( i )

k P Z l i=l N p - n d - 4

+ 2n log - q4

D ( i )+ ( N ( i ) p d 2) logN ( i ) p d 4

+ 2nlog . - q4

(7.13)

(7.14)

(7.15)

The last two a posteriori probabilities are only approximate, since Stirling'sformula has been used to eliminate gamma functions; the exact expressionsare found in Appendix 7.B. Equation (7.16) defines the involved statistics.

Data Y1, YZ,.., Ykl Ykl+l, 0 7 Y k z . . .- - k,_l+l, 0 7 Y k

Segmentation Segment 1 Segment 2 . . . Segment n (7.16)

LS estimates @ I ) , ~ ( 1 ) 8 ( 2 ) , ~ ( 2 ) . . . W , ( n )Statistics (1L W V P ) , ) * * V ( n > ,W4

The required steps n computing the M A P estimated segmentation are asfollows. First, every possible segmentation of th e da ta is examined separately.For each segmentation, one model for every segment is estimated and the teststatistics are computed. Finally, one of equations (7.13)-(7.15) is evaluated.

In all cases, constants in the a posteriori probabilities are omitted. Thedifference in the three approaches is thus basically only how to treat the sumof squared prediction errors. Aprior probability q causes a penalty ermincreasing linearly in n for q < 0.5. As noted before, q 0.5 corresponds toML estimation.

Th e derivations of (7.13) to (7.15) are valid only if all terms are well-defined. The condition is that P ( i ) has full rank for all i , and that the de-nominator under V(i ) is positive. Th at is, N p - n d - 4 > 0 in (7.14) andN ( i ) p d > 0 in (7.15). Th e segments must therefore be forced to be longenough.

7.3.3.On the choice of priors

The Gaussian assumption on th e noise is a standa rd one, partly because itgives analytical expressions and partly because it has proven to work well in




practice. Other alternatives are arely seen. The Laplacian distribution isshown in Wu and Fitzgerald (1995) to also give an ana lyt ical solution in thecase of unknown mean models. It was there found th at it is less sensitive to

large measurement errors.The standard approach used here for marginalization is to consider both

Gaussian and non-informative prior in parallel. We often give priority to anon-informative prior on 8, sing a flat density function, in our aim o have asfew non-intuitive design parameters as possible. Th at is, p ( P I X n ) = C is anarbitrary constant in (7.10). The use of non-informative priors, and especiallyimproper ones, is sometimes criticized. See Aitken (1991) for an interestingdiscussion. Specifically, here the flat prior introduces an arbitrary term n log Cin the log likelihood. The idea of using a flat prior, or non-informative prior,

in marginalization is perhaps best explained by an example.

Example 7.7 Marginalized likelihood for variance estimation

Suppose we have t observations from a Gaussian distribution; yt E N ( p , X .Thu s the likelihood p(ytl ,u,X) is Gaussian. We want t o compute the likelihoodconditioned on just using marginalization: p ytIX) Jp ytlp,X)p p)dp.

Two alternatives of priors are a Gaussian, p E N(p0, PO) , nd a flat prior,p ( p ) C. In both cases, we end up with an inverse Wishart density function

(3.54) with maximas

where jj is the sample average. Note th e scaling factor l / ( t - ) , which makes

the estimate unbiased. Th e joint likelihood estimate of both mean and ariancegives a variance estimator scaling factor l / t . Th e prior thus induces a bias inthe estimate.

Thus, a flat prior eliminates th e bias induced by the prior. We remarkth at th e likelihood interpreted as a conditional density function s proper, andit does not depend upon the constant C.

Th e use of a flat prior can be motivated as follows:

The data dependent terms in th e log likelihood increase like log N . Thatis, whatever the choice of C, th e prior dependent term will be insignifi-cant for a large amount of data.



7.3 Statistical 239

The choice C M 1 can be shown to give approximately the same like-lihood as a proper informative Gaussian prior would give if the trueparameters were known and used in the prior. See Gustafsson (1996),

where an example is given.

More precisely, with the prior N(I90, P O ) , here 130 is the true value of O ( i ) theconstant should be chosen as C = det PO. Th e uncertainty about 190 reflectedin PO should be much larger tha n the da ta information in P ( i ) f one wantsthe da ta to speak for themselves. Still, the choice of PO is ambiguous. Thelarger value, the higher is the penalty on a large number of segments. This isexactly L i n d l e y ' s p a r a d o x (Lindley, 1957):

Lindley s paradoxTh e more non-informative prior, the more the zero-hypothesis isfavored. I

Thus, the prior should be chosen to be as informative as possible withoutinterfering with da ta. For auto-regressions and other regressions where theparameters are scaled to be around or ess than 1, the choice PO = I is appro-priate. Since the true value 80 is not known, this discussion seems to validate

th e use of a flat prior with the choice C = 1, which has also been confirmed towork well by simulations. An unknown noise variance is assigned a flat prioras well with the same pragmatic motivation.

Example 7.2 Lindley s paradox

Consider the hypothesis test

H0 :y E N(0 , l )

H1 :Y EN ( 4

11,

and assume that the prior on I is N(I30,Po). Equation (5.98) gives for scalarmeasurements that

Here we have N 1, P (P; + l)- + 1 and 81 P l y + y. Then thelikelihood ratio is




since the whole expression behaves like 1 / a . This fact is not influenced bythe number of da ta or what the true mean is, or what 130 is. Th at is, th e morenon-informative the prior, the more H0 is favored

7.4. Information based criteria

Th e information based approach of this section can be called a penalized Max-i m u m Generalized Likelihood (MGL) approach.

7.4.1. The MGL estimator

It is straightforward t o show th at th e minimum of (7.9) with respect to P ,assuming a known X ( i ) ,is

MGL ( P) min -2 logp(yN P , x ~o n

N

N p log(27r) +c og det (Rt )t= l

n

+c a N(i ) og(X(i)p)) .i = l W

(7.17)

Minimizing the right-hand side of (7.17) with respect t o a constant unknownnoise scaling X(i) X gives

and finally, for a changing noise scaling

MGL(kn) = min -210gp(yNIlcn, P , n)0 ,X

N

= N p og(27r) +c og det (Rt)t = l

n

i=l

(7.19)



7.4 Informationased 241

In summary, the counte rparts to the MML estimates (7.13)-(7.15) are givenby the MGL estimates

n,

(7.20)

(7.21)

(7.22)

It is easily realized tha t these generalized likelihoods cannot directly be usedfor estimating k n , where n is unknown, because

for any given segmentation, inclusion of one more change time willstrictly increase the generalized likelihood (C V ( i ) ecreases), and

the generalized likelihoods (7.18) and (7.19) can be made arbitrarilylarge, since C V ( i ) 0 and V ( i ) 0, respectively, if there are enoughsegments. Note that n = N and Ici = i is one permissible solution.

Th at is, the parsimonious principle s not fulfilled-there is no trade-off betweenmodel complexity and data fit.

7.4.2. MGL with penalty term

An attempt to satisfy the parsimonious principle is to add a penal ty te rm t oth e generalized likelihoods (7.17)-(7.19). A general form of suggested penaltyterms is n ( d + l ) y ( N ) , which is proportional to th e number of parameters usedto describe the signal (here the change time itself is counted as one parame-ter). Penalty terms occuring in model order selection problems can be used inthis application as well, like Akaike's A I C (Akaike, 1969) or the equivalent cri-teria: Akaike's B I C (Akaike, 1977), Rissanen's Minim um Descr ip t ion Length( M D L )approach (Rissanen, 1989) and Schwartzcriterion (Schwartz, 1978).The penalty term in AIC is 2n(d + 1) and in BIC n ( d + 1) og N

AIC is proposed in Kitagawa and Akaike (1978) for auto-regressive modelswith a changing noise variance (one more parameter per segment), leading to

(7.23)




and BIC is suggested in Yao (1988) for a changing mean model cpt = 1) andunknown constant noise variance:

kn arg min Np log +n ( d + 1) og NL W )

kn,n NP (7.24)

Both (7.23) and (7.24) are globally maximized for n N and ki i. This issolved in Yao (1988) by assuming that an upper bound on n is known, but itis not commented upon in Kitagawa and Akaike (1978).

The MDL theory provides a nice interpretation of the segmentation prob-lem: choose the segments such that the fewest possible data bits are used todescribe the signal up to a certain accuracy, given tha t both the parame tervectors and the prediction errors are stored with finite accuracy.

Both AIC an d BIC are based on a n assumption on a large number ofda ta, and its use in segmentation where each segment could be quite shortis questioned n Kitagawa and Akaike (1978). Simulations n Djuric (1994)indica te that AIC and BIC tend to over-segment dat a in a simple examplewhere marginalized ML works fine.

7.4.3. Relation to MML

A comparison of the generalized likelihoods (7.17)-(7.19) with t he marginal-

ized likelihoods (7.13)-(7.15) (assuming q 1/2), shows that he penaltyterm introduced by marginalization is ' D ( i ) in all cases. It is thereforeinteresting to study thi s term in more detail.

Lemma 5.5 shows

log det PNlog N

- + d , N +CO.

This implies that D i ) - ogdet P ( i ) C:=l dlog N(i). Thepenalty term in MML is thus of th e same form asymptotically as BIC. If the

segments are roughly of the same length, then C:=l D ( i ) ndlog(N/n). Notehowever, that the behavior for short segments might improve using MML.

Th e BIC criterion in the context of model order selection is known to be aconsistent estimate of the model order. It is shown in Yao (1988) th a t BIC is aweakly consistent estimate of the number of the change times in segmentationof changing mean models. The asymptotic link with BIC supports the use ofmarginalized likelihoods.

7.5. On-line local search for optimum

Computing the exact likelihood or information based est imate is computation-ally intractable because of the exponential complexity. This section reviews




0 0 0

\ o 0

Figure 7.1. The tree of jump sequences. A path marked 0 corresponds to no jum p, while 1in the S-parameterization of the jum p sequence corresponds to a jump.

local search techniques, while the next section comments on numerical meth-ods.

7.5.1. local tree search

In Section 7.A, an exact pruning possibility having quadra tic in time com-plexity is described. Here a natural recursive (linear n ime) approximatealgorithm will be given. Th e complexity of th e problem can be compared t othe growing tree in Figure 7.1. The algorithm will use terminology from thisanalogy, like cutting, pruning and merging branches. Generally, the global

maximum can be found only by searching through the whole tree. However,the following arguments indicate heuristically how the complexity can be de-creased dramatically.

At time t , every branch splits into two branches where one corresponds toa jump. Past da ta contain no information about what happens after a jump.Therefore, only one sequence among all those with a jump at a given timeinstant has to be considered, i.e. th e most likely one. This is th e point in thefirst step, after which only one new branch in t he tre e is started at each timeinstan t. That is, there are only N branches left. Th is exploration of a finitememory property has much in common with the famous Viterbi algorithmin equalization, see Algorithm 5.5 or th e articles Viterbi (1967) and Forney(1973).




It seems to be a waste of computational power to keep updating proba-bilities for sequences which have been unlikely for a long time. However, onestill canno t be sure that one of them will not st ar t to grow and become the

MAP estimate. The solution offered in Section 7.A, is to compute a commonupper bound on the a posteriori probabilities. If this bound does not exceedthe MAP estimate’s probability, which is normally the case, one can be surethat the true MAP estimate is found. The approximation in the followingalgorithm is to simply reject these sequences.

Th e following algorithm is a straightforward extension of Algorithm 4.1.

Algorithm 7.7 Recursive parameter segmentation

Choose an optimality criterion. The options are the a posteriori prob-abilities as in Theorem 7.3, 7.4 or 7.5, or th e information criteria AIC(7.23) or BIC (7.24).

Compute recursively the optimality criterion using a bank of least squaresestimators, each one matched to a particular segmentation.

Use the following rules for maintaining the hypotheses and keeping thenumber of considered sequences M ) ixed:

a) Let only the most probable sequence split.

b) Cut off the least probable sequence, so only M are left.c) Assume a minimum segment length: let th e most probable sequence

split only if it is not too young. A suitable default value is 0.

d ) Assure th a t sequences are not cut off immediately after they are born:cut off the least probable sequences amo ng those that are older thana certainminimum ifelength, until only M are left. This shouldmostly be chosen as large as possible.

The last two restrictions are important for performance. A tuning rulein simulations is to simulate the signal without noise for tun ing the localsearch parameters.

The output of the algorithm at time t is the parameter estima te of th e mostprobable sequence, or possibly a weighted sum of all estimates. However, itshould be pointed ou t that the fixed interval smoothing estimate is readilyavailable by back-tracking th e history of the most probable sequence, whichcan be realized from (7.16). Algorithm 7.1 is similar to th e one proposed inAndersson (1985). However, this algorithm is ad hoc, and works only for thecase of known noise.

Section 4.3 contains some illustrative examples, while Section 7.7 uses thealgorithm in a number of applications.




7.5.2. Design parameters

It has already been noted th at the segments have to be onger than a minimumsegment length, otherwise th e derivations in Appendix 7.B of (7.14) and (7.15)are not valid. Consider Theorem 7.5. Since the re is a term r((N(i)p-d-2)/2)and the gamma function r z )has poles for z = 0, -1, -2,. . . , the segmentlengths must be larger than (2 + d ) / p This is intuitively logical, since d datapoints are required to e stimate I and two more t o estimate X.

That it could be wise to use a minimum lifelength of the sequences can bedetermined as follows. Suppose the model struc ture on the regression is a thirdorder model. Then at least three measurements are needed to estimate theparameters, and more are needed to judge the fit of the model to data. That

is, after a t least four samples, something intelligent can be said abo ut the datafit. Thus , the choice of a minimum lifelength is related t o the dentifiability ofthe model, and should be chosen larger than dim(I9) + 2.

It is interesting to point ou t the possibility of forcing th e algorithm to givethe exact MAP estimate by specifying th e minimum lifelength and the numberof sequences to N . In this way, only the first rule is actually performed (whichis th e first step in Algorithm 7.3). The MAP estim ate is, in this way, foundin quadratic time.

Finally, the jump probability q is used to tune the number of segments.

7.6. Off-line global search for optimum

Numerical approximations th a t have been suggested include dynamic pro-gramming (Djuric, 1992), batch-wise processing where only a small numberof jump t imes is considered (Kitagawa and Akaike, 1978), and MCMC meth-ods, but it is fairly easy to construct examples where these approaches haveshortcomings, as demonstrated in Section 4.4.

Algorithm 4.2 for signal estimation is straightforward to generalize to th eparameter estimation problem. This more general form is given in Fitzgeraldet al. (1994), and is a combination of Gibbs sampling and the Metropolisalgorithm.




Algorithm 7.2 MCMC segmentation

Decide the number of changes n and choose which likelihood to use. Th e

options are the a posteriori probabilities n Theorems 7.3, 7.4 or 7.5 withq 0.5:

1. Iterate Monte Carlo run i.

2. Ite rate Gibbs sampler for component j in k n , where a random numberfrom - (kj1kT except 5 )

is taken. Denote the new candidate sequence p he distribu tion maybe taken as flat, or Gaussian centered around the previous estimate.

3. The candidate j is accepted if the likelihood increases, p ( F ) > p ( k n ) .Otherwise, candidate j is accepted (the Metropolis step) if a randomnumber from a uniform distribution is less th an th e likelihood ratio

After the burn-in (convergence) time, the distribu tion of change times can becomputed by Monte Carlo techniques.

The last step of random rejection sampling defines the Metropolis algo-rithm. Here the candidate will be rejected with large probability if its valueis unlikely.

We refer to Section 4.4 for illustrative examples and Section 7.7.3 for anapplication.

7.7. ApplicationsThe first application uses segmentation as a means for signal compression,modeling an EKG signal as a piecewise constant polynomial. In th e secondapplication, the proposed method is compared to existing segmentation meth-ods. In an attempt to be as fair as possible, first we choose a test signal thathas been examined before in th e literature. In this way, it is clear tha t th e al-gorithms under comparison are tuned as well as possible. The last applicationconcerns real time estimation for navigation in a car.



7 . 7 Amlications 247

10Piecewise linear m odel

10Piecewise quadratic model

5 5

0 0

-550 100500050 300 0 50 10050005000

-5

Signal and segmentation10

Signal andsegmentation

5

0

-50 5 10 15 205 30 35Sampel number Sampel number

( 4 (b)Figure 7.2. An E K G signal and a piecewise constant linear model (a) and quadratic model(b ), respectively.

7.7.1. Storing EKG signals

The EKG compression problem defined in Section 2.6.2 is here approachedby segmentation. Algorithm 7.1 is used with 10 parallel filters and fixed noisevariance u2 = 0.01. The assumption of fixed variance gives us a tool to controlthe accuracy in the compression, an d to tra de t off to compression rate. Figure7.2 shows the EKG signal and a possible segmentation. For evaluation, thefollowing statistics are interesting:

Model type Linear Quadrat ic

Regression0.032

Number of parametersCommession rate (%) 10

With this algorithm, the linear model gives far less erro r and almost thesame compression rate . The numerical resolution is th e reason for the poorperformance of the quadratic model, which includes the linear one as a specialcase. If the lower value of u 2 is supplied, then the performance will degradesubstantially. The remedy seems to be another basis for the quad ratic poly-nomial.




Table 7.1. Estimated change times for different methods.

I Method I n a I Estimated change times

Noisy signal

Divergence 16 45111450900125830626Brandt’s GLR I 16 I 45111450900125830626Brandt’s GLR

Pre-filtered signal451 593 1608 2116 2741 2822 3626 2 ApproxL

593 1450 2125 2830 3626 2

Divergence 16 44545550800151797626Brandt’s GLR 16

445 626 1609 2151 2797 3627 2 Approx ML44545550750151797400626 2randt’s GLR44545550800151797626

7.7.2. Speech segmentationTh e speech signal’ under consideration was recorded inside a car y th e FrenchNational Agency for Telecommunications, as described in Andre-Obrecht 1988).This example is an continuation of Example 6.7, and the performance of thefilter bank will be compared to th e consistency tests examined in Chapter 6.

To get a direct comparison with the segmentation result in Basseville andNikiforov (1993), a second order AR model is used. The app rox imate MLestimate is derived ( 4 = 1/2) , using 10 parallel filters where each new segmenthas a guaranteed lifelength of seven. The following should be stressed:

Th e resemblance with th e result of Brandt’s GLR test and the ivergencetest presented in Basseville and Nikiforov 1993) is striking.

No tuning parameters are involved (although q # 1 / 2 can be used toinfluence the number of segments, if not satisfactory). This should becompared with he tricky choice of threshold, window size and driftparameter in the divergence test and Brandt’s GLR est-which, further-more, should be different for voiced and unvoiced zones. Presumably, a

considerable tuning effort was required in Andre-Obrecht (1988) to ob-tain a result similar to that which the proposed method gave in a firsttry using default parameters tuned on simple test signals.

The drawback comp ared to the wo previously mentioned methods is asomewhat higher computationa l complexity. Using t he same implemen-tation of the required RLS filters, the number of floating point operationsfor AR(2) models were 1.6 106 for Brandt’s GLR test and 5.8 . 106 forthe approximate ML method.

‘The author would like to tha nk Michele Basseville and Regine Andree-Obrecht for sharingthe speech signals in this application.




Th e design parameters of th e search scheme are not very critical. Thereis a certain lower bound where the performance drastically deteriorates,but there is no trade-off, as is common for design parameters.

With the chosen search strategy , the algorithm is recursive and the es-timated change points are delivered with a time delay of 10 samples.This is much faster th an the other methods due to the ir sliding win-dow of width 160. For instance, the change at time 2741 for the noisysignal, where th e noise variance increases by a factor 3 (see below), isonly 80 samples away from a more significant change, and cannot bedistinguished with the chosen sliding window.

Much of the power of the algorithm is due to the model with changing noise

variance. A speech signal has very large variations in the driving noise. Forthese two signals, the sequences of noise variances are estimated to

105 X (0.035, 0.13, 1.6, 0.37, 1.3, 0.058, 1.7)

105 X (0.038, 0.11, 1.6, 0.38, 1.6, 0.54, 0.055, 1.8),

respectively. Note, that th e noise variance differs as much as a factor 50. Noalgorithm based on a fixed noise variance can handle that .

Therefore, the proposed algorithm seems to be an efficient tool for getting a

quick and reliable result. Th e lack of design parame ters makes it very suitablefor general purpose software implementations.

and

7.7.3. Segmentation of a car s driven path

We will here study the case described in Section 2.6.1.

Signal model

Th e model is that th e heading angle is piecewise constant or piecewise linear,corresponding to straight paths and bends or roundabouts. The changingregression model is here

+l = (1 + &ut

+t @ +& +et

E e i A t .

The approximate M A P estimate of the change times can now be computed inreal time (the sampling interval is 100 times larger than needed for compu-tations). The number of parallel filters is 10, the minimum allowed segmentlength is 0 and each new ju mp hypothesis is guaranteed to survive at least sixsamples. The prior probability of a jump is 0.05 at each time.




Local search

Segmentation with a fixed accuracy of the model, using a fixed noise varianceX 0.05, gives the result in Figure 7.3. Figure 7.3 also shows the segmentationwhere the noise variance is unknown and changing over the segments. In bo thcases, the roundabou t is perfectly modeled by one segment and the bends aredetec ted. The seemingly bad performance after the first t u rn is actually aproof of the power of this approach. Little data ar e available, and there is nogood model for them, so why waste segments? It is more logical to toleratelarger errors here and just use one model. This proves th a t th e adaptive noisevariance works for this application as well. In any case, the model with fixednoise scaling seems to be the most appropria te for this application. The main

reason is to exclude small, though significant, changes, like lane changes onhighways.

Optimal search

The opt imal segmentation using Algorithm 7.3 gives almost the same segmen-tation. The estimated change time sequences are

= (20, 46, 83, 111, 130, 173)

Greclmarg (18, 42, 66, 79, 89, 121, 136, 165, 173, 180)g T J t , m a r g

= (18, 42, 65, 79, 88, 110, 133, 162, 173, 180),

respectively.

Robustness to design parameters

The robustness with espect to th e design parameters of th e approximation isas follows:

The exact MAP estimate is almost identical to th e approximation. Thenumber of segments is correct, but three jumps differ slightly.

A number of different values on the jump probability were examined.Any value between q 0.001 and q 0.1 gives the same number ofsegments. A q between 0.1 and 0.5 gives one more segment, just at theentrance to the roundabout.

A smaller number of filters th an A4 = 10 gives one or two more segments.

Th at is, a reasonable performance is obtained for almost any choice of designparameters.




Positionctualndegmentedeadingnglel

0 -

-100-

-200

-300

5 400~

-500 ~

-600

-700

-800

-900

-

-

~

-800

Position

-500

-600

-700

-800

-900

-800

3

2

6 1

g6.@ l

= 2

-3

-4

-5 l50 1005000

Time [ ]250

O t

-100

-200

-300

-

l-900

-800 600 400 -200 0 200 400Meter

Actual and segmented heading angle

3

2

6 1

g6

= 2

-3

-4

-5 l50 1005000 250

Tlme [ ]

( 4Actual and segmented heading angle

3

2

6 1

g

6

n

= 2

-3

-4

l50 1005000

Tlme [ ]250

Figure 7.3. The pat h with e stimate d change points and the actual and segmented heading

angle. First row for fix noise variance 0.05, second row for marginalized noise variance, andfinally, optimal M L estima te for marginalized noise variance.




60

140 160 180

Figure 7.4. Result of the M C M C Algorithm 7.2. The left plot shows th e ju mp sequenceexamined in each iteration. The encountered sequence with the overall largest likelihoodis marked with dashed lines. The right plot shows a histogram over all considered jumpsequences. The burn-in time is not excluded.

MCMC search

Figure 7.4 shows the result of the MCMC algorithm in Algorithm 7.2. Thealgorithm was initiated at th e estimated change times from Algorithm 7.1.

We note that two change times at the entrance to the roundabout are mergedinto one after 25 iterations. The histogram shows th a t some of th e changesare easier to locate in time. This follows intuitively when studying the path.For example, the first two turns are sharpe r than the following ones.

7.A. Two inequalities or likelihoods

In this section, two inequalities for the likelihoods will be derived. They hold

for both generalized and marginalized likelihoods, although they will be ap-plied to the la tter only.

7.A.1. The irst inequality

Th e following theorems consider th e S-parameterized changing regression model( 7 . 3 ) .

Theorem 7 1Consider the problem of either maximum likelihood segmentation using 7.13)or (7.15) with q 0.5, or MAP segmentation using (7.13) or (7.15), or in-formation based segmentation using (7.20), (7.21) or (7.22) with an arbi trary



7.A Two ineaualities for likelihoods 253

Remarks:

Similar to t he Viterbi algorithm 5.5, there is a finite memory propertyafter a change which implies that we can take the most likely sequencebefore the change and dismiss with all other possibilities.

For MAP and ML segmentation , the a posteriori probabilities below aregiven in Theorems 7.3 and 7.5.

Note that th e case (7.14) is not covered by the theorem. The reason isth at th e segments are not independent for the assumption of a constantand unknown noise scaling.

The theorem implies th at , conditioned on a change at time t o , the MAP

sequences at t ime t , SLAP,must begin with the M A P sequence at time

t o , S S A p .

Proof: Given a jump at t o ,

Th e second equality is Bayes' law and the last one follows since the supposed

jum p implies that the measurements before the ju mp are ncorrelated with theju mp sequence afte r the jump, and vice versa. We have also used causality atseveral instances. The theorem now follows from the fact t ha t th e first factoris always less than or equal to p(StolytO). 0

Th e reason tha t t he case of unknown and constant noise scaling does notwork can be realized form th e last sentence of the proof. All measurementscontain information about X, so measurements from all segments nfluence themodel and S t .

The Vite rbi algorithm proposed in Viterbi (1967) is a powerful tool inequalization; see Forney (1973). There is a close connection between thisstep and the Viterbi algorithm. Compare with the derivation of the Viterb iAlgorithm 5.5.




Theorem 7.1 leaves the following t + 1 candidates for the M A P estimate ofthe jum p sequence at time t:

S(0) O,O, ..., 0)S ( k ) (q/ , 1,0, ..., 0) k 1 , 2 , ..., . (7.27)

At tim e t, t filters are updated which should be compared to 2t for a straight-forward implementation. The total number of updates sums up to t2/2.

7.A.2. The second nequali ty

Consider the two jump sequences

Th e second step intuitively works as follows. Given a jump at time to or to +1,the measurements before t o , ytoP1, are independent of the ju mp sequence af-ter to + 1 and vice versa. Thus, it is only the measurement yt, that containsany information about which ju mp sequence is th e correct one. If this mea-surement was not available, then these two sequences would be the same and

indistinguishable. One then compensates for th e 'deleted' measurement yt,according to th e worst case, and gets an upper bound on the probabilities.

Theorem 7 2Consider the problem of either maximum likelihood segmentation using 7.13)or (7.15) with q 0.5, or MAP segmentation using (7.13) or (7.15), or in-formation based segmentation using (7.20), (7.21) or (7.22) with an arbi trarypenalty term. Consider theparticularsequencesp St to)Iyt) ndp(St(to-l)lyt)in (7.28). These are both bounded from above,

where

and ymaX= (27r-Pl2 det R,, .1 12

Proof: The a posteriori probability p(S(t0 - ) l y t ) is expanded by using Bayes'law and the fact that the supposed jump at t o divides t he measurements into



7.A Two ineaualities or likelihoods 255

two independent parts:

Expanding p ( b ( t 0 ) yt) in a similar way gives

Note tha t th e last factor of each expansion is th e same, and the irst is knownat time t o . The point is that p(ytol whatsoever ) is bounded from above byymaz y 0, Rto) andheheorem follows. 0

Note tha t all probabilities needed are already available. Also, note thatmerged sequences may be merged again, which implies that one upper boundis in common for more tha n two sequences. An interesting question is howpowerful this second inequality is. I t was shown in Gustafsson (1992) that

this second inequality reduces the complexity from O(t2) to 0 (&) if thereis just one segment in the data, and to O(nt) f the number of segments n islarge. In simulated examples, th e number of sequences is very close to nt. InSection 7.7, the complexity is reduced by a factor of four in an applicationwith real data and imperfect modeling.

7.A.3. The exact pruning algorithm

Th e two inequalities lead to th e following algorithm, which finds the exact

M A P estimate with less than t filters.




Algorithm 7 3 Optimal segmentation

Start from the a posteriori probabilities given in Theorem 7.3 or Theorem 7.5

for the sequences under consideration at time t :1. At time t + 1, let only the most likely sequence jump.

2. Decide whether two or more sequences should be merged. If so, computea common upper bound on their a posteriori probabilities, and considerin the sequel these two merged branches as just one.

If the most probable sequence has larger probability th an all upper bounds,then the MAP estimate s found. Otherwise, restart the algori thm or acktrackthe history of that upper bound and be more restrictive in merging.

Note that the case of unknown constant noise scaling does not apply here.Th e first step is the most powerful one: it is trivial t o implement and makesit possible to compute the exact ML and MAP estimate for real signals. It isalso very useful for evaluating the accuracy of low-complexity approximations.

The first steps in Algorithms 7.3 and 7.1 are the same. The second stepin Algorithm 7.1 corresponds t o the merging step in Algorithm 7.3. Insteadof computing an upper bound, the unlikely sequences are just cut off.

7.B. The posterior probabilities of a jump sequence

7.B.1. Main theoremsTheorem 7.3 Segmentation w it h known noise variance)Consider the changing regression model 7.3). The a posteriori probability ofk n for a known noise scalings An and a Aat prior on On is given by

n

-210gp(knly ,X ) = c + 2nlog + C ( D ( i )+V ( i ) )n l - q (7.29)

4 i = l

if P ( i ) s non-singular for all i

Proof: Without loss of generality, it can be assumed that X i ) = 1 since it canbe included in Rt. Let Y ( i ) enote the measurements ?J:-, ~ n segment i .

Bayes' law implies the relations p AIB) = p (BIA ) and p ( A l ,A z , . . , A n )p A1)p A21Al)..p AnIAl, A2, ., An-l), nd this yields



7.B Theosterior mobabilities of a iumD seauence 257

Since the model parameters are independent between the segments, th e mea-surements from one segment are independent of other segments,

(7.30)

The law of total probability givesr

As mentioned, a flat and non-informative prior on B is assumed, th at is p ( B )-1, so

To simplify the expressions somewhat, a constant Rt = R is assumed. It isnow straightforward to complete the squares for B, similarly as done to prove(5.103), n order to rewrite the integrand as a Gaussian density function whichintegrates to one. The result is the counterpart to equat ion (5.98) for a non-informative prior, and the remaining factor is

P ( Y ( i ) )=PT N(i)P/P(det R)-N( i ) /2X( i ) - -N( i )p /2(2~)d /22/de t X(i)p( i )

where V ( i ) nd D ( i ) are as defined in (7.6) and (7.7), espectively. Taking

the logarithm, using (7.12) nd collecting all terms that do not depend on thejump sequence in the onstant C the result follows. 0

Theorem 7.4 Segmentation with constant noise variance)The a posteriori probability of k n in the changing regression model (7.3) foran unknown but constant noise covariance scaling X i ) X, with a non-informative prior on both B i ) and X, is given by

n

-210gp(knlyN) c + 2nlog - q +C D ( i ) (7.32)Q i = l




i f P ( i ) s non-singular forall i and ' V ( i ) 0.

Proof: The proof sta rts from equation 7.31),and the conditioning on X is writ-ten ou t explicitely: p ( k n l y N )= p ( k n l y N , X )in (7.30) and p ( Y N )= p ( Y N I X )in (7.31). The law of total probability gives

Here X is separated from the definitions (7.6) and (7 .7) .To solve the integral, the inverse Wishart ProbabilityDensityFunction

(PDF)

is utilized, where is the gamma-function. A PDF integrates to one, so

and result follows. 0

Th e o r e m 7.5 (Segmentation with changing noise variance)T h e a posteriori probabil i ty o f kn in the changing regression model 7.3) foran unknown chan ging noise scal ing X i ) , with a Aat prior on both O ( i ) andX i ) , is given by

- 2 logp(kn1yN) c + 2n log - q

4(7.33)

D i )+

( N ( i ) p - d - 2) log - V ( i ) - 2 log l?1

2i=l

i f P ( i ) s non -singular and V ( i ) 0 fo r all i .



7.B Theosterior mobabilities of a iumD seauence 259

Proof: Th e proof is almost identical to th e one of Theorem 7.4. Th e differenceis that the re is one integral for each segment, and again they are solved byidentification with the inverse Wishart PDF.


The results can be omewhat simplified by using r ( n + l ) M &nn+lj2ePn,

which is Stirling's formula. It follows th at th e ollowing expression can be usedfor reasonably large segment lengths (> 30 roughly):

log r M og(27r) + ( N ( i ) p d 4)

N ( i ) p d 4

2N ( i ) p - d - 4

2

- N ( i ) p- d - 5) log

M N ( i ) p d 2 ) log l

and similarly for F( Np-Fd-2 .This relation is used in (7.13)-(7.15).



Part W State estimation





Kalman filtering8.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2648.2. State space modeling . . . . . . . . . . . . . . . . . . . . 267

8.2.1. Sampling ormula . . . . . . . . . . . . . . . . . . . . . . . 2688.2.2. Physicalmodeling . . . . . . . . . . . . . . . . . . . . . . 2688.2.3. Using known transfer unctions . . . . . . . . . . . . . . . 2728.2.4. Modeling tricks . . . . . . . . . . . . . . . . . . . . . . . . 274

8.3. The Kalman filter . . . . . . . . . . . . . . . . . . . . . . 2788.3.1. Basic formulas . . . . . . . . . . . . . . . . . . . . . . . . 2788.3.2. Numerical xamples . . . . . . . . . . . . . . . . . . . . . 2808.3.3. Optimality roperties . . . . . . . . . . . . . . . . . . . . 284

8.4. Time-invariant signal mode l . . . . . . . . . . . . . . . . 868.4.1. Error sources

. . . . . . . . . . . . . . . . . . . . . . . . .287

8.4.2. Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . 2888.4.3. Frequency response . . . . . . . . . . . . . . . . . . . . . . 2898.4.4. Spectral factorization . . . . . . . . . . . . . . . . . . . . 289

8 5 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 2908.5.1. Fixed-lag smoothing . . . . . . . . . . . . . . . . . . . . . 2908.5.2. Fixed-interval smoothing . . . . . . . . . . . . . . . . . . 292

8.6. Computational aspec ts . . . . . . . . . . . . . . . . . . . 958.6.1. Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . 2968.6.2. Cross orrelated noise . . . . . . . . . . . . . . . . . . . . 2978.6.3. Bias error . . . . . . . . . . . . . . . . . . . . . . . . . . . 2988.6.4. Sequential processing . . . . . . . . . . . . . . . . . . . . . 299

8.7. Square root mplementation . . . . . . . . . . . . . . . . 008.7.1. Time nd measurement updates . . . . . . . . . . . . . . 3018.7.2. Kalman predictor . . . . . . . . . . . . . . . . . . . . . . . 3048.7.3. Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . 304

8.8. Sensor usion . . . . . . . . . . . . . . . . . . . . . . . . . 3068.8.1. The information ilter . . . . . . . . . . . . . . . . . . . . 3088.8.2. Centralized fusion . . . . . . . . . . . . . . . . . . . . . . 3108.8.3. The general fusion formula . . . . . . . . . . . . . . . . . 3108.8.4. Decentralized fusion . . . . . . . . . . . . . . . . . . . . . 311

8.9. The extend ed Kalman filter . . . . . . . . . . . . . . . . 13





264 Kalman filterina

8.9 .1. Measurement update . . . . . . . . . . . . . . . . . . . . . 3138.9.2.Time pdate . . . . . . . . . . . . . . . . . . . . . . . . . 3148.9 .3 . Linearization error . . . . . . . . . . . . . . . . . . . . . . 318

8.9.4.Discretization of state noise . . . . . . . . . . . . . . . . . 3218.10. Whiteness based change detection using th e Kalmanfilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324

8.11. Estimation of covariances in stat e space models . . . . 268.12. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 327

8.12.1. DC motor . . . . . . . . . . . . . . . . . . . . . . . . . . . 3278.12.2. Target racking . . . . . . . . . . . . . . . . . . . . . . . . 3288.12.3. GPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

8.1. Basics

The goal in this section is to explain the fundamenta ls of Kalman filter theoryby a few illustrative examples.

The Kalman filter requires a state space model for describing the signaldynamics. To describe its role, we need a concrete example, so let us return tothe targe t tracking example from Chapter 1. Assume that we want a modelwith the states z1 = X , x 2 = Y , x 3 = X och x4 = Y . This is the simplestpossible case of sta te vector used in practice. Before we derive a model in thenext section, a few remarks will be given on what role th e different terms inthe model has.

Example 8.7 Target tracking: function of Kalman filter

Figure 8.l(a) illustrates how the Kalman filter makes use of the model.Suppose the object in a target tracking problem (exactly the same reasoningholds for navigation as well) is located at the origin with velocity vector (1 , l ) .For simplicity, assume that we have an es tim ate of the state vector which

coincides with the true value

&l0 = (O,O, 1 , l ) = xi.

In practice, there will of course be some error or uncertainty n th e esti-mate. The uncertainty can be described by a confidence interval, which inthe Kalman filter approach is always shaped as an ellipsoid. In th e figure, theuncertainty is larger in the longitudinal direction th an in the latera l direction.

Given the initial state, future ositions can be predicted by just integratingthe velocity vector (like in dead reckoning). This simulation yields a straightline for the position. With a more complex state space model with more tates,more complex manoeuvres can be simulated. In the state noise description,



8.1 Basics 265

0 2 3 4 5

(a)

-1 -I0 0.5 1 1.5 2 2.5 3 3.5

(b)

Figure 8.1. Th e plot in (a ) shows how t he model predicts future positions of the object,given the st at e vector at time 0, an d how the unce rtainty region increases in time. Th e plotin (b) shows how th e filter takes care of th e first measurement by correcting he sta te vectorand decreasing the uncertainty region.

we can model what kind of manoeuvres can be performed by the aircra ft, forinstance, 1 m/s2 in longitudinal direction and 3 m/s2 in lateral direction. Thepossible and unpredictable manoeuvres in the future imply th at th e uncer-tainty grows in time, just like illustrated by the ellipsoids in Figure 8.l (a) .Note th at the object makes a very sha rp manoeuvre that makes the actua lpath xt going outside the confidence interval.

Figure 8 .l (b ) shows how the Kalman filter acts when the first measurementx1 becomes available.

1. First, the estimate is corrected towards the mea surement. The velocitystate component is adapted likewise.

2. Then the uncertainty is decreased accordingly.

The index rules can now be summarized as:

0 For the true trajectory xt, a single index is used as usual. Time is indeedcontinuous here.

0 For a simulation sta rting at time 0, as shown in Figure S. l (a ) , the se-quence xol0, 1l0, ~ 3 1 0 , . . is used. Th e rule is th at th e first ndex istime and the second one indicates when the simulation is started.

0 When the Kalman filter upda tes the est imate, the second index is in-creased one unit. The Kalman filter is alternating between a one timestep simulation, and updating the sta te ector, which yields the sequence

x O I O , x l l O , x l l l x211, x212, 2312, *




0 Hats are used on all computed quantities (simulated or estimated), forinstance Q t .

The state space model used in this chapter is

Here yt is a measured signal, A , B , C, D are known matrices and xt an unknownstate vector. There are three inputs to the system: the observable (and con-

trollable) ut, the non-observable process noise vt and the measurement noiseet. The Kalman filter is given by

where the update gain Kt is computed by the th e Kalman filter equations asa function

Figure 8.2 summarizes th e signal flow for t he signal model and the Kalmanfilter.

Noises e t ,w t o u tp u t yt Y t E t M etInput ut P t M 2 talman filtertate x t U tSystem

c c

c c

Figure 8.2. Definition of signals for th e signal model and Ka lman filter.

Example 8.2 Target tracking: function of state space model

The role of the state space model in Example 8.1 is summarized a s follows:

0 The deterministic model (the A matrix ) describes how t o s imulate thestate vector. In the example, th e velocity sta te is used to find futurepositions.



8.2 State space modeling 267

The stochastic term in the s ta te equation, &ut , decides how the confi-dence interval grows ( the size of the random walk). One extreme case isQ + CO when anything can happen in one time step and all ld informa-

tion must be considered as useless, and the ellipsoid becomes infinitelylarge. The other extreme case is Q + 0, which corresponds to the ellip-soid not growing at all.

The deterministic part of t he measurement equation yt = Cxt tells theKalman filter in what direction one measurement should affect the esti-mate.

The covariance R of the measurement noise et describes how reliable themeasurement nformation is. Again, the extreme points are important

to understand. First, R = 0 says that the measurement is exact, and inthe example this would imply that illl would coincide with x1, and thecorresponding ellipsoid must break down to a point. Secondly, R + m

means tha t the measurement is useless and should be discarded.

Much of the advantage of using th e Kalman filter compared to more ad hocfilters (low-pass) is tha t the design is moved from an abstract pole-zero place-ment to a more concrete level of model design.

Literature

The standa rd reference for all computational aspects of the Kalman filterhas for a long time been Anderson and Moore (1979), but from now on itis likely tha t the complete reference will be Kailath et al. (1998). This is athorough work covering everything related to st at e space estimation. Thesetwo references are a bit weak regarding applications. A suitable book withnavigation applications is Minkler and Minkler (1990). Other monographs areBrown an d Hwang (1997) and Chui and Chen (1987).

8.2. State space modeling

Designing a good Kalman filter is indeed more a modeling than a filter designtask. All that is needed to derive a good Kalman filter in an application is tounderstand the modeling and a few tuning issues. The rest is implementationproblems that can be hidden in good software. That is, the first sub-goal isto derive a state space model




Th e model is completely specified by th e matrices A , B = [B,, B,], C, D , ,R .System modeling is a subject covered by many textbooks; see for instanceLjung and Glad (1996). The purpose here is to give a few but representative

examples to cover the applications in this part . Since many models are derivedin continuous time, but we are concerned with discrete time filtering, we startby reviewing some sampling formulas.

8.2.1. Sampling ormula

Integration over one sample period T of the continuous time model

gives

X t + T =xt + ( A C x ,+B:u7) d r

Assuming that the deterministic input s constant during the ampling interval,which is the case in for instance computer controlled sys tems, the solution canbe written as the discrete time state space model

xt+T = Axt +&ut7

where

A =e ACT

The same sampling formulas can be used for th e stochastic input ut in (8.4),although it is seldom the case tha t ut is constant during the ampling intervals.Other alternatives are surveyed in Section 8.9. There are three main ways tocompute the matrix xponential: using th e Laplace transform, eries expansionor algebraic equations. See, for example, Astrom and Wittenmark (1984) fordetails. Standa rd computer packages for control theory can be used, like thefunction c2d in Control System Toolbox in MATLABTM .

8.2.2. Physical modeling

Basic relations from different disciplines in form of differential equations canbe used to build up a state space model. The example below illustrates manyaspects of how th e signal processing problem can be seen as a modeling one,rather than a pure filter design. Much can be gained in performance by de-signing the model appropriately.



8.2 State m a c e modelina 269

Figure 8.3. Tracking coordinates.

Example 8.3 Target tracking: modeling

Consider an aircraft moving in two dimensions as in Figure 8.3. UsingNewton’s law F = m u separately in two dimensions gives

.. 1 F1xt =-m

..2 F 2xt =-.m

Here Fi are forces acting on the aircraft, normally due to the pilot’s manoeu-vres, which are unknown to th e tracker. From now on, t he right-hand sidewill be considered as random components W:. In vector form, we get

The standard form of a state space model only has first order derivatives in

the left-hand side. We therefore introduce dummy states, which here has thephysical interpretation of velocities:

2; =X;

xt” =xt

xt” =wt

4

1

- 4 -w2xt - t‘

Here the noise is denoted w i ather than v so as not to confuse it with thevelocities vi = x i i = 1’2 . Assuming that the position is measurable, we geta continuous time sta te space model for the state vector X = (xl, 2, ’, w 2 ) instandard form:




0 0 1 00

= A C z t + B 3 u t =

(.0 0 1 )G+

(;)W t ,

0 0 0 0

1 0 0 0Yt = (o 1 o) xt +et .

Here superscript c indicates continuous time. The sample formula with sam-pling interval T gives

There are two possibilities for the choice of Q. The simplest is a diagonalQ = 412. An aircraft can accelerate much faster orthogonally to th e velocityvector than parallel to it. Th at s, the forces during turns are much larger tha n

during acceleration and retardation. This implies tha t we would like to havedifferent adapta tion gains in these two directions. The only problem is th atthe aircraft has its own coordinate system which is not suitable for tracking.However, we can use the heading (velocity direction) estimate

and assume th at th e velocity changes mainly orthogonally to h. It can beshown that the appropr iate state dependent covariance matrix is

Q = ( qw sin2(h) + qv cos2(h) qv - ,, sin(h) cos(h)qv - qw)sin(h) cos(h) qwcos2 ( h ) + qv sin2 h ) ) (8.8)

Here qv is the force variance along the velocity vector and qw is perpendicularto it , where qw >> qv.

It has been pointed out by many author s tha t dur ing manoeuvres, bet-ter tracking can be achieved by a so-called j e r k model. Here two more ac-celeration states are included in th e model, a nd the state noise is now the

jerk rather han acceleration. The state space model for the state vectorIC = (X , x 2 , , w 2 U ' , u2) T is:



8.2 State ma ce modelina 271

~ 1 O T 0 0 0O l O T O OO O l O T OO O O l O T0 0 0 0 1 0

\ 0 0 0 0 0 1

xt +(i3 / 3 0

‘0 0

0 0 W t.

1 00 11

wt .

Again, the covariance matrix in (8.8) is a reasonable option.

Another strategy is a compromise between the acceleration and jerk mod-

els, inspired by the physical constraints of the motion of th e aircraft. Since theacceleration is mainly orthogonal to th e velocity vector, one more sta te can beintroduced for this acceleration. The acceleration orthogonal to th e velocity isthe tu rn rate W , and the state equations or the sta te vector ( x l , x 2 ,l , v 2 ,w ) ~is given by:

xt =ut

xt =ut

ut = - W t V t

ut =W&

W,=o.

- 1 1

- 2 2

- 1 2

- 2 1

(8.11)

The main advantage of this model is th a t circular paths correspond to con-stant turn rate, so we have a parameter estimation problem rather than atracking problem in these segments. Pa ths consisting of straight lines andcircle segments are called coordinated turns in the lite rature of Air Tr u f i cControl (ATC).

One further alternative is to use velocities in polar coordinates rather thanCartesian. With the state vector ( x l , x 2 ,, , w ) ~ ,he state dynamics become:




a =ut cos(ht)

at =ut sin(ht)vt =o (8.12)

ht =W

W,=o.

Both (8.11) and (8.12) are non-linear models, and sampling and filtering arenot straightforward. Section 8.9 is devoted to this problem, and several solu-tions to this problem will be described.

As a summary, there are plenty f options for choosing the model, and eachone corresponds to one unique Kalman filter. The choice is a compromisebetween algorithm complexity and performance, but it is not at all certainthat a more complex model gives better accuracy.

8.2.3. Using known ransfer unctions

There is a standard transformation to go from a given transfer function to astate space model which works both for continuous and discrete time models.The observer companion form for a transfer function

b1sn-' + +b,-ls + b,G(s) =

sn +u1sn-'+ +U,-lS +U,

is

Here the input ut can be replaced by s ta te noise wt, fault f t or disturbances

d t .It should be mentioned tha t the stat e space model is not unique. There

are many other transformations.



8.2 State m a c e modelina 273

Example 8 4 DC motor: model

Consider a sampled s ta te space model of a DC motor with continuous imetransfer function

1 1G(s) =

s(s + 1) S 2 +S'

--

The continuous time state space model is in so called observer canonical form,

2t = ( 0 o) xt + 3t-1 1

yt = (1 0) xt + et.

A bet ter choice of s ta te variables for physical interpretations is x1 being theangle and x2 he angular velocity of the motor. The derivation of th e cor-responding state space model is straightforward, and can be found n anytextbook in control theory. Sampling with sample interval T, = 0.4 S gives

Finally, we revisit the communication channel modeling problem from Chapter5 in two examples leading to st at e pace models and Kalman filter approaches.

Example 8.5 Fading communication channels

Measurements of the FIR coefficients bz, i = 1,2, . . . , b , in a fading com-munication channel can be used to es timate a model of the parameter vari-ability. The spectral content is well modeled by a low order AR model. One

possible realization of channel time variations will be shown in Figure 10.6(a).Assume tha t we have estimated t he model bl = -uibl-, - bd-2+vi for eachcoefficient bi. The corresponding sta te space model is

= (- - )X: + 3;

Ai

b: = (1 0) X:.

An application on how to utilize the predictive ability of such a model is givenin Example 8.6.




Example 8.6 Equalization using hyper models

A so called hyper model of parameter variations can be used to improvetracking ability. Consider the digital communication channel in Example 8.5.Equalization incorporating this knowledge of parameter variation can use thefollowing state space model (assuming n b = 2, i.e. an FIR (2) channel model):

yt = ( u t o ut-1 0) xt +et .

See Lindbom (1995) for extensive heory on this matte r, and Davis et al.(1997) for one application.

8.2.4. Modeling tricks

A very useful trick in Kalman filtering s to augment th e state vector with someauxiliary states x a , nd then to pply the Kalman filter to th e ugmented statespace model. We will denote the augmented state vector 3 = (xT, x ~ ) ~ ) ~ .tshould be noted here th at th e Kalman filter is the optimal estima tor of anylinear combination of th e st at e vector. That is, given the assumptions, there isno better way to estimate the state vector xt than to estimate the augmentedsta te. This section lists some impor tan t cases where this trick is useful.

Colored state noise

Assume the state noise s colored, so we can write ut = H ( q ) f & , where iswhite noise. Let a state space realization of H ( q ) be

=A"X; +BVV,

ut =cv.;.The augmented state space model is

yt =(C, 0 ~ t et.

The relation is easily verified by expanding each row above separately.




Colored measurement noise

Assume the measurement noise is colored, so we can write et = H ( q ) E t ,whereet is white noise. We can apply t he same trick as in th e previous case, but inth e case of Markov noise

there is a way to avoid an increased state dimension.The obvious approach is to pre-filter the measurements, jjt = H-l (q )y t ,

where H ( q ) is a stable minimum phase spec tral factor of t he measurementnoise, so that the measurement noise becomes white. The ac tual way of im-plementing this in state space form is as follows. Modify th e measurements

as

Thus, we havea

standard state space model again (though the time index ofthe measurement is non-standard), now with correlated state and measure-ment noises.

Sensor offset or trend

Most sensors have an unknown offset. This can be solved by off-line cali-bration, but a more general alternative is given below. Mathematically, thismeans th a t we must estimate th e mean of the measurements on-line. Th estraightforward solution is to modify the state and measurement equation asfollows:

+ l = (0 I) (2) (0U t (8.14)

(8.15)

Here it should be remarked tha t in off-line situations, the mean can be esti-mated directly from the measurements and removed before the Kalman filteris applied. If there is an input U , th e de terministic part of y should be sub-tracted before computing the mean. According to least squares theory, thesolution will be the same.




In the same way, drifts are modeled as

z t + 1 = ( 0 I 0 )

(%)9)ut

A 0 0

0 0 1

yt = (c I t I > (%)et.

An alternative is to include the drift in the offset state as follows:

(8.16)

(8.17)

z t + l = 0 0 I

(8.18)

(8.19)

The advantage of this latter alternative is that the state space model is time-invariant if the original model is time-invariant.

Sensor faults

Th e sensor offset and drift might be due to a sensor fau lt. One model of asensor fault is as a sudden ffset or dri ft. The sta te pace model correspondingto this is

zt+l = (0 I 0) (%)( ut + d t - k ($) (8.20)

A 0 0

0 0 1

B, t f-

(8.21)

where k denotes the time the fault occurs.

Actuator faults

In the same way as a sensor fault, a sudden offset in an actu ator is modeledas

Z t S 1 =Atxt +&$Ut +& ut +d t - k B u , t f u (8 .22)yt =Ctxt +et. (8.23)




If th e offset las ts, rather than being jus t a pulse, more states x f , t can beincluded for keeping the fault f u in a memory

zt+l = ( I ) (G) ri.t> t + ut +bt-k y)f u (8.24)At Bu, t

(8.25)

State disturbances

The third kind of additive change in state pace model, beside sensor and actu-ator fault is a state disturbance. Disturbances and, for example, manoeuvres

in target tracking are well described by

xt+1 =Atxt +&$Ut +B?J,tvt+b t - k B f f (8.26)

yt =Ctxt +et. (8.27)

Parametric state space models

If unknown system parameters 8 enter the state space model linearly, we havea model like

This is like an hybrid of a linear regression and a st ate space model. Usingthe sta te vector zT = (xT, T ) , we get

(8.31)

The Kalman filter applies, yielding a combined state and parameter estima tor.If the parameter vector is time-varying, it can be interpreted as an unknowninput. Variants of the Kalman filter as such a kind of an unknown nputobserver are given in for instance Keller and Darouach (1998, 1999).

SmoothingThe method for fixed-lag smoothing given in Section 8.5 is a good example ofstate augmentation.




8.3. The Kalman ilter

Sections 13.1 and 13.2 contain two derivations of the Kalman filter. I t is

advisable at this stage to study, or recapitulate, the erivation that suits one'sbackground knowledge the best.

8.3.1. Basic formulas

A general time-varying s ta te space model for the Kalman filter is

Th e dimensions of the matrices will be denoted n,, nu, n,, n g , espectively.The time upd ate and measurement upda te in the Kalman filter are given by:

Comments:

0 It is convenient to define and compute three auxiliary quantities:

E t =Yt - CtQ- l (8.38)

St =CtPtlt-lCT +Rt (8.39)

Kt =Ptlt-lCF(CtPtlt-lCF +Rt)-' = Ptlt-lC,TSC1. (8.40)

The interpretations are that t is the covariance matrix of the innovation

E t , and Kt is the Kalman gain.

0 With the auxiliary quantities above, equation (8.37) can be written morecompactly as Ptlt = PtltP1 - KtStKF.

0 It is suitable to introduce a dummy variable L = Ptlt-lCF to save

computations. Then we compute St = CtL + Rt, Kt = LS; andptlt = - K ~ L ~ .

0 Th e indexing ule is th a t is th e projection of xt onto the spacespanned by the measurements y1, y2, . . . ,yk , or in t he statistical frame-work the conditional expectation of xt, given the set of measurements

Y1, Y2, ,Yk.



8 . 3The Kalman filter 279

0 P = E(z - k ) ( z - k )T is th e covariance matrix for th e st at e estimate.

0 The update equation for P is called the discrete Riccati equation. Even

if the state space model is time-invariant, the Kalman filter will be time-varying due to the transient caused by unknown initial conditions. Th efilter will, however, converge to a time-invariant one. More on this inSection 8.4.

0 The Kalman gain and covariance matrix do not depend upon data, andcan be pre-computed. This is in a way counter intuitive, because theactual filter can diverge without any noticeable sign except for very largeinnovations. Th e explanation is th e prior belief in th e model. More on

this in Section 8.6.1.

0 As for adaptive filters, it is very illuminating to analyze th e estimationerror in term s of bias, variance and tracking error. This will be done inSections 8.6.3 for bias error, and Section 8.4.1 for the other two errors,respectively.

0 Th e signal model (8.32) is quite general in th at all matrices, includingthe dimensions of the input and output ectors, may change in time. For

instance, it is possible to have a time-varying number of sensors, enablinga straightforward solution to the so-called mu lti-rate signal processingproblem.

0 The covariance matrix of the stochastic contribution to the state equa-tion is B,,tQtBCt. It is implicitly assumed tha t th e factorization is doneso that Qt is non-singular. (This may imply that i ts dimension is time-varying as well.) In many cases, this term can be replaced by a (singular)covariance matrix Qt = Bv,tQtBct of dimension n, X n,. However, cer-tain algorithms require a non-singular Qt.

There are several ways to rewrite the basic recursions. Th e most common onesare summarized in the algorithm below.

Algorithm 8.1 Kalman filter

The Kalman filter in its filter form is defined by the recursion




The one-step ahead predictor form of the Kalman filter is defined by therecursion

kt+llt = A t Q - l + Bu,tut +A t K t ( y t- C t Q - 1 )

= ( A t - AtKtCt) i t l t - l +Bu,tut +AtKtYt. (8.42)

In both cases, the covariance matrix given by t he recursions (8.35) and (8.37)is needed.

8.3.2. Numerical examples

The first numerical example will illustrate the underlying projections n atwo-dimensional case.

Example 8 7 Kalman filter: a numerical example

A two-dimensional example is very suitable for graphical illustration. Themodel under consideration is

X t + l = )xt , X0 = (;)-1

yt = (1 0) xt +et.

The state equat ion describes a rotation of th e st at e vector 90 to the left.

The absence of state noise facilitates illustration in that the re is no randomwalk in the s ta te process. The measurement equation says th a t only the firststate component is observed. Figure 8.4(a) shows the first th ree sta te vec-tors X O , I , z2, and the corresponding measurements, which are the projectionon the horizontal axis. Th e measurement noise is assumed negligible. Alter-natively, we may assume that the realization of the measurement noise withcovariance matrix R is eo = e1 = 0.

The Kalman filter is initialized with i o l l = (0, O ) T, and covariance matrix

POI-1 = a l .Simple and easily checked calculations give

aK1 = 1)

a + R 0



8.3 he Kalman filter 281

J I

Figure 8.4. True states and estimated estimates from the Kalman filter's two first timerecursions. In (a), the measurements are included , and in (b) the covariance ellipses aremarked.

Note that & < 1. For large (U,r small R , it holds that

Figure 8.4(b) illustra tes the change in shape in the covariance matrix aftereach upda te. For each measurement, one dimension is compressed, and thetime update is due to the lack of state noise just a rotation.

We make the following observations:

0 The time upda te corresponds to a 90 rotation, just in the same way asfor the true state.




0 A model error, when the tru e and modeled A matrices are not the same,implies that the rotations are not exactly the same.

0 The measurement update is according to th e projection theorem a pro-

jection of the true state onto the measurement. See th e derivation nSection 13.1. The plane II s interpreted here as th e last measurement.It is only the init ial uncertainty reflected in P th at prevents the estimatefrom coinciding exactly with the projection.

0 It is the rotation of the st at e vector th at makes th e s ta te vector observ-able With A = I, he true state vector would be constant and x2 wouldnot be possible to es timate. This can be verified with th e observabilitycriterion to be defined in Section 8.6.1

Th e second example is scalar, and used for illustrating how the Kalmanfilter equations look like in the simplest possible case.

Example 8.8 Kalman filter: a scalar example

It is often illustrative to rewrite complicated expressions as th e simplestpossible special case at hand. For a sta te space model, this corresponds toth at all variables are scalars:

xt+l =axt +butyt =cxt +et

E(& =4

E(et) =T-.

The sta te equation can be interpreted as an AR( 1) process < l ) , whichis by the measurement equation observed in white noise. Figure 8.5 shows anexample wit h

a = 0 . 9 , b = 0 . 1 , c = 3 , q = 1 , r = l .

The Kalman filter divided into residual generation, measurement upda te andtime update, with the corresponding covariance matrices, is given by

(8.43)



8.3 The Kalman filter 283

State

0.8

0.6

0.4

0.2

1 , 1 1 1 1 Noise-free xtSimulated xt

01 I0 5 10 15 20

Time [samples]

Figure 8.5. Simulation of a scalar state space model. For reference, a stat e space simulationwithout state noise is shown, as well as th e noise-free outp ut.

Note that the time upda te s the stand ard AR predictor. It might be instruc-tive to try to igure out how the struc ture relates to th e general linear filteringformulas in Sections 13.1.3 and 13.2.1, which is a suitable exercise.

Figure 8.6 shows an example for th e realization in Figure 8.5. Note howthe measurement update improves the prediction from the time upd ate byusing the information in the current measurement. Note also tha t althoughwe are using the best possible filter, we cannot expect to get a perfect estimateof the state.

The dummy variables' P$, and St have th e interpretation of covari-ance matrices. We can therefore plot confidence intervals for the estimates inwhich the true sta te s expected to be in. In this scalar example, the confidenceinterval for the filtered estimate is Q t f i . Since all noises are Gaussianin this example, this corresponds to a 95% confidence interval, which is shownin Figure 8.6(b) . We see that the true state is within this interval for all 20samples (which is better than the average behavior where one sample can beexpected to fall outside the interval).

Example 8.9 Kalman filter: DC motorConsider the state space model of a DC motor in Example 8.4. A sim-

ulation of this model is shown in Figure 8.7. It is assumed here th at only



8.3 he Kalman filter 285

18

16 - rueposition

True state and measurements

easurement y

- - True elocity

14

12-

10-

8 -

-

6 -

4 -

2

0-

- - - C - - - - ---.-2

0 10 20 30 40 50 60 70

Figure 8.7. Simulation of a DC motor.

Position

D

Velocity3

- Estimate as diff y)True velocitv

A 10 20 30 40 50 60 70 ,b -3b I 20 30 40 50 60 70 d0

( 4 (b)

Figure 8.8. Kalman filter estimate of angle (left plot) and angular velocity (right plot).

or in the minimum variance sense, or the maximum a posteriori sense,cf. Section 13.2 .

0 Independently of the distribution of ZO, ut, t , assuming t he specified co-variance matrices reflect the tr ue second order moments, th e Kalmanfilter is the best possible linear filter. What is meant by ‘best’ is definedas above, but only in the unconditional meaning. Tha t is, PtltP1 sthe unconditionalcovariance matrix of the MSE. As a counter example,where there are better conditional filters th an th e Kalman filter, is whenthe state or measurement noise have frequent outliers. Since the noise is




still being white, the Kalman filter is the best linear filter, but then ap-propriate change detectors in Chapters 9 and 10 often work much bette ras state estimators.

0 All system matrices may depend on data, and he nterpretation ofand Pt+llt as conditional estimate and covariance, respectively,

still holds. An important case is the linear regression model, where theregression vector Ct contains old measurements.

8.4. Time-invariant signal model

We will here study the special case of a time-invariant state space model

in more detail. The reason for this special attention is th a t we can obtainsharper results for the properties and gain more insight into the Kalman ilter.

Assume the Kalman filter is applied at time t o for th e first time. Th eKalman gain Kt will then approach the steady state Kalman gain K whent o + -cc or when t + 00. These nterpretations are equivalent. In hesame way, Pt + P . The existence of such a limit assumes asymptotic time-invariance and stabili ty IX(A)I < 1 of th e signal model (8.44). To actuallyprove convergence is quite complicated, and Kalman himself s ta ted it to behis main contribution to th e Kalman filter theory. A full proof is found inAnderson and Moore (1979) and Kailath et al. (1998).

Algorithm 8.2 Stationary Kalman filter

The stationary Kalman filter in predictor form is given by

The stationary covariance matrix, P , is found by solving the non-linearequation system (8.47), which is referred to as the sta tionar y iccati equation.Several approaches exist, the simplest one being o iterate the Riccati equationuntil convergence, which will always work due to stabili ty of the filter. Solving



8.4 Time-invariantianall 287

the stationary Riccati equation is the coun terpa rt to spect ral factorization inWiener filtering, cf. Section 13.3.7.

It can be shown from the convergence theory t hat th e Kalman filter is

stable,IX(A - AKC)I < 1. (8.49)

It is interesting to note t hat stability is ensured even if the signal model inunstable. Th at is, the Kalman filter is stable even if th e system is not. Theconditions for stability are that unstable modes in A are excited by the noiseBvut an d that these modes are observable. In standard terms (Kaila th, 1980),these two conditions can be expressed as follows: 1) the pair [A,C ] is de-tectable, and 2) the pair [A,B,Q1/2] s stabilizable. Here Q1j2 denotes any

matrix such that Q1/2(Q1/2)T Q (a square root); see Section 8.7.

8.4.1. Error sources

The time-invariant predictor form is, subst itu ting the measurement equation(8.45) into (8.46),

= ( A - AKC)k'tlt-l +AKCxt +AKet. (8.50)

In the scalar case, this is an exponential filter with two inputs. The forgetting

factor is then 1 K C , and the adaptation gain is KC. A small K hus impliesslow forgetting.

Thestimation rror = xt - is from (8.50) using the signalmodel, given by

Zt+llt = ( A - AKC)PtI't-l +AKet +B,vt.

Thus, the transient and variance errors are (see Section 8.6.3 for a procedurehow to measure the bias error):

ti E t l l t = ( A - AKC)tzo +C ( A AKC)"' (AKek +B,vt) .- = /

variance error

There are now several philosophies for state estimation:

0 The stationary Kalman filter aims at minimizing the variance error, as-suming the initial error has faded away.

0 An observerminimizes the transient, assuming no noise, or assuming de-terministic disturbances Bvut which cannot be characterized in stochas-tic framework (a deterministic disturbance gives a perturbation on thestate that can be interpreted as a transient).




0 Th e time-varying Kalman filter minimizes the sum of transient and vari-ance errors.

8.4.2. Observer

We will assume here th at th er e is no process noise. To proceed as for adaptivefilters in Section 5.5, we need a further assumption. Assume all eigenvaluesof A - A K C are distinct. Then we can factorize T- l D T = A - A K C , whereD is a diagonal matr ix with diagonal elements being th e eigenvalues. In t hegeneral case with multiple eigenvalues, we have to deal with so-called Jordanforms (see Kailath (1980)). The transient error can then be written

The transient will decay to zero only if all eigenvalues are str ictly less thanone, IX(A - AKC)I < 1. The ra te of decay depends upon the largest eigen-value max IX(A - AKC)I. Th e idea in observer design is to choose K suchthat the eigenvalues get pre-assigned values. This is possible if the pair [ A ,C ]

is observable (Kai lath, 1980). The only thing ha t prevents us from mak-ing the eigenvalues arbitrarily small is th at a disturbance can imply a hugecontribution to th e error (though it decays quickly after the disturbance hasdisappeared). Assume a step disturbance entering at time 0, and that thetransient error is negligible at this time. Then

t t - l

A fast decay of the trans ient requires tha t the gain K is large, so the directterm AK will be large as well, even if D = 0.

Consider the case of a scalar output. We then have n degrees of freedomin designing K . Since there are n eigenvalues in D , and these can be chosenarbitrarily, we have no degrees of freedom to shape T . This is well known fromcontrol theory, since the zeros of the observer are fixed, and the poles can bearbitrarily chosen. Th e special case of D = 0 corresponds to a deadbeatobserver (Kailath , 1980). The compromise offered in the Kalman filter isto choose the eigenvalues so th at th e transient error and variance error arebalanced.



8.4 Time-invariantianall 289

8.4.3. Frequency esponse

The assumption on stationarity allows us t o define th e frequency response ofthe Kalman filter. Taking th e x-transform of (8.46) gives

That is, the transfer functions of the Kalman filter when t + 00 or t o + -00

are

Here Hy(eiUTsis a nz X ny matrix of transfer functions (or column ector wheny is scalar), so there is one scalar transfer function from each measurement toeach state, and similarly for Hu(e iWTs) .Each transfer function has the samen, poles, and generally n, - 1 zeros (there is at least one time delay).

We now analyze the transfer function t o the measurement y instead of tothe state X. Assume here that, without loss of generality, there is no input.Since Y = CX we immediately get

~ $ ~ ; e d = c ( ~ ~ w T ~ IA + A K C ) - ~ A K . (8.53)

Equation (8.53) gives the closed loop transfer function in the input-outputrelation Y ( x )= H K F ( x ) Y ( x )of theKalman filter. That is, quation(8.53) gives the transfer function from y to ij in Figure 8.9. An open loopequivalent from innovation E to y is readily obtained in t he same way. We get

closed loop

as the transfer functions of the model and Kalman filter, respectively. SeeFigure 8.9.

8.4.4. Spectral actorization

An interesting side effect of these calculations, is that a spectral factoriza-tion can be obtained automatically. The calculations below are based on the

super formula: if Y ( x )= S ( z ) U ( z ) ,hen the spe ctrum is given by Qgy(z) =lS(x)12@uu(x)= S ( X ) S ( X - ~ ) Q . , , ( X )n the unit circle. For matrix valued trans-fer functions, this formula reads Qyg(x)= S ( z ) Q U u ( z ) S T ( x - ' ) .




Figure 8.9. Interplay of model and Kalm an filter transfer functions.

It follows from Figure 8.9 that the ou tput power spec trum is

QYy(z)=R +C ( x I - A)-lB,QB,T(~I - A)-TCT (8.54)

=R + Hmo d e l (2 )QHmo d e l (8.55)

The spectral fac tori zatio nproblem is now to find a transfer function th at givesthe same spectrum QyU(z) or a white noise input with Q U U ( z ) 1. Tha t is,find S ( z )such that QyU(z) s ( ~ ) S ~ ( z - ~ ) .he result is not immediate from(8.55) because it is the sum of two terms.

Alternatively, we can deduce from Figure 8.9 th at E = y - H K F E ,so

y = ( I +H K F ) E ,ndQgy(z) = I +H K F ( Z ) ) ( R +CPCT)(I + (8.56)

Note that R +CPCT is the covariance matrix of the innovation. That is, thenon-minimum phase stable spectral factor of the output spec tru m is given byS ( z )= ( I +H K F ( z ) ) ( RCPCT)1/2 the square root is an ordinary matrixone). This expression can be used to compute th e Wiener filter for vectorvalued processes, a task that can hardly be done without computing the sta -tionary Kalman filter. The derivation s based on Anderson and Moore (1979),p. 86.

8.5. Smoothing

8.5.1. Fixed-lag smoothing

Fixed-lag smoothing is to est imate the sta te vector xt from measurementsy(s), S 5 t + m. If m is chosen in th e order of two, say, time constants ofthe system, almost all of th e information in th e da ta is utilized. The trick t oderive the optimal filter is to a ugment the sta te vector with delayed states,

x ; = x t - i , i = l , ... m + l . (8.57)



8.5 Smoothina 291

The augmented state vector is (cf. Section 8.2.4)

zt = (8.58)

For simplicity, we will drop the tim e indices on the s tate matrices in thissubsection. Th e full signal model is

where

A 0 0 . . . 0 0'I 0 0 * * * 0 00 I 0 . . . 0 00 0 I * * * 0 0

. . . . . ... . .0 0 0 :.. I 0,

Bv, t = i i)

(8.59)

(8.60)

(8.61)

(8.62)

Estimate kt-mlt, kt-m+llt,. . . ,Qt by applying the Kalman ilter on (8.59)-(8.60). The es timate of xtPm given observations of ys, 0 5 S 5 t is given byxt+llt. This is recognized as the last sub-vector of the Kalman filter prediction

X t + l l t .

To simplify the equations somewhat, let us write out the Kalman redictorformulas

A m+1

A-

Next, we aim at simplifying (8.63)-(8.65). I t follows from (8.62) th at he

innovation covariance can be simplified to it s usual low dimensional form




Split Ptlt-l into n X n blocks

-

Ptlt-1 =

which gives

(8.67)

(8.68)

To compute the Kalman gain, we need to know Pi>’, = 0 , . . . ,m + l . Theseare given by a subst itution of (8.67) in the Riccati equation (8.65)

P:: O = P:i;U_l(A - AKtC)T, i = 0,. . . ,m , (8.69)

where Kt = I?:. Tha t is, to compute the s tate update we only need t o knowthe quantities rom the standard Kalman redictor and the original st ate spacematrices of size n X n , and the covariance matrices updated in (8.69). Tocompute the diagonal matrices P:i)te_l, = 0 , . . . ,m + 1, corresponding to the

smoothed estimate’s covariance matrix, a substi tution of (8.67) in the Riccatiequation (8.65) gives

(8.70)

oloit-l is found from the Riccati equation for the original signal model.

8.5.2. Fixed-interval smoothing

In this off-line filtering situation, we have access to all N measurements andwant to find the best possible state estimate Q N . One rather naive wayis to use the fixed-lag smoother from the previous section, let m = N andintroduce N fictive measurements (e.g. zeros) with nfinite variance. Thenwe apply the fixed-lag smoother and at time t = 2N we have all smoothedestimates available. Th e reason for introducing information less measurementsis solely to being able to ru n th e alman filter until all useful information in hemeasurements have propagated down to the ast sub-vector of th e augmentedstate vector in (8.58).

A better way is to use so-called forward-backward ilters. We will studytwo different approaches.



8.5 Smoothina 293

Rauch-Tung-Striebel formulas

Algorithm 8.3 gives the Rauch-Tung-Striebel formulasfor fixed-interval smooth-ing, which are given without proof. The notation i i t l means as usual theestimate of xt given measurements up to time N

Algorithm 8.3 Fixed nterval smoothing: Rauch-Tung-Striebel

Available observations are yt, t = 0,. . . ,N Run the standard Kalman fil-ter and store both the time and measurement updates, 2tlt,itlt-l ,Ptlt, PtltP1.Apply the following time recursion backwards in time:

The covariance matrix of t he estimation error PtlN is

This algorithm is quite simple to implement, but it requires that all stateand covariance matrices, both for the prediction and filter errors, are stored.That also means that we must explicitely apply both time and measurementupdates.

Two-filter smoothing formulas

We will here derive a two-filter smoothing formula,which will turn out to beuseful for change detection.

Denote the conditional expectations of past and future data

respectively. Here Z F and P$ are the estimates from the Kalman filter (these

will not be given explicitly). The index F is introduced t o stress tha t thefilter runs forwards in time, in contrast to th e filter running backwards intime, yielding k i t+ , and P +l, to be introduced. Qu ite logically, 2i t+l s theestimate of xt based on measurements yt+l , yt+2,. . . ,y ~ , hich is up to timet + 1 backwards in time.

t l t




The smoothed estimate is, given these distributions, a standard fusionproblem, whose solution will be derived in Section 8.8.3. The result is

There is an elegant way to use the Kalman filter backwards on da ta to omputeO +, and P$+l. What is needed is a so-called backward model.

The following lemma gives the desired backwards Markovian model that

is sample path equivalent to (8.32). Th e Markov property implies tha t th enoise process { u t } ,and the final value of th e st at e vector X N , re independent.Not only are the first and second order statistics equal for these two models,but they are ndeed sample path equivalent, since they both produce th e samestate and outpu t vectors.

Lemma 8.1

T he fol lowing model is sam ple pa th equivalent to8.32) and is i ts correspond-ing backwaxd Ma rkovian model

(8.72)

Here

where IIt = E [ x t x T ] i s t h e a priori covaxiance matrixof the s ta te vec tor,computed recursively by + l = A&A? + Qt. The las t equ al i t ies of (8.73)and (8.75) hold i f At is invertible.

Proof: Seeerghese andailath (1979). 0

Note that no inversion of the state noise covariance matrix is needed, sowe do not have to deal with the factorized form here.

The Kalman filter applied to th e model (8.72) in reverse time provides th equantities sought:



8.6 Comw tational asDects 295

Th e backward model can be simplified by assuming no initial knowledgeof the tate, and hus I I o = Pllo = 001, rmore formally, = 0. Then

IIt1 = 0 for all t , and the lat ter expressions assuming invertible At give

A, =At1

Q , = A t l Q t A t T.

These last two formulas are quite intuitive, and the result f a straightforwardinversion of the state equation as xt = A-lxt+l - A-lvt .

Algorithm 8.4 Fixed nterval smoothing: forward-backward filtering

Consider the state space model

1. Run the standard Kalman filter forwards n time and store the filterquantities Q t , P+.

2. Com pute the backward model

and run the standard Kalman ilter backwards in time and store both the'predictions' ktl t+l , Ptlt+l.

3. Merge the information from the forward and backward filters

8.6. Computational aspects

First, some brief comments on how to improve the numerical properties aregiven before we examine major issues:




In case the measurement noise covariance is singular, the filter may benumerically unstable, although the solution to the estimation problemis well defined.

- One solution to this is to replace the inverse of St in Kt with apseudo-inverse.

- Another solution is regularization. Tha t is to add a small identitymatrix to Rt. This is a common suggestion, especially when themeasurement is scalar.

- A singular R means th a t one or more linear combinations of th estate vector can be exactly computed from one measurement. Athird solution is to compute a reduced order filter (Anderson and

Moore, 1979). This is quite similar to he Luenberger observer(Kaila th, 1980; Luenberger, 1966). The idea is to transform hesta te and observation vector to ?,g such that the exactly knownpart of g is a sub-vector of 2 which does not need to be estimated.

Sometimes we have linearly dependent measurements, which give causeto a rank deficient C,. Then there is the possibility to reduce the com-plexity of the filter, by replacing the observation vector with an equiva-lent observation,see Murdin (1998), obtained by solving the least squares

problem yt = Ctxt + t . For instance, when there are more measurementsthan sta tes we get the least squares (snapshot) estimate

zt = (c,Tc~)- T, yt = xt + (C,TCt)-'C,Tet A g t .

Then

1Jt = I x ~ Et, et E N(0, (C,TCt>- C,TRtCt(C,TCt)-l)

can be fed to th e Kalman filter without loss of performance.

Outliers should be removed from th e filter. Here prior knowledge andda ta pre-processing can be used, or impulsive disturbances can be ncor-porated into the model (Niedzwiecki and Cisowski, 1996; Settine ri et al.,1996).

All numerical problems are significantly decreased by using square rootalgorithms. Section 8.7 is devoted to this issue.

8.6.1. DivergenceDivergenceis the case when the covariance matrix does not reflect the trueuncertainty in the state estimate. The simplest way to diagnose divergence




is to compare the innovation variance St with an estima te formed from theKalman filter innovations, e.g.

k=O

Other whiteness-based innovation tests apply, see Section 8.10. Another indi-cation of divergence is that the filter gain Kt tends to zero. Possible causes ofdivergence are:

0 Poor excitation or signal-to-noise ratio. We have seen th e importance ofexcitation in adaptive filtering (when At = I and Ct = p?). Basically,we want the input signals to make Ct point in many different directions,

in the sense thatt

k = t - L + l

should be invertible for small L . This expression is a measure of theinformationin the measurements, as will become clear in the informationfilter, see Section 8.8.1. For a time-invariant signal model, we mustrequire observability,tha t is

CC AC A 2

CAn-]

has full column rank; see Kailath (1980). To get good numerical proper-ties, we should also require that the condition number is not too small.

0 A small offset in combination with unstable or marginally unstable signalmodels is another case where divergence may occur. Compare this withnavigation applications, where the position estimate may drift away,without any possibility of detecting this.

0 Bias errors; see Section 8.6.3.

8.6.2. Cross correlated noise

If there is a correlation between the noise processes E(vte?) = Mt, so that




one trick to get back on track is to replace the state noise with

(8.76)

(8.77)

Equation (8.76)gives

The Kalman filter applies t o this model. The last output dependent term inthe s tate equation should be interpreted as an ex tra inpu t.

8.6.3. Bias error

A bias error due to modeling errors can be analyzed y distinguishing the true

system and the design model, denoted by super-indices o and d , respectively.The state space model for the true syste m and the Kalman ilter for th e designmodel is

Let Z denote the augmented state vector with covariance matrix p . This isa state space model (without measurements), so th e covariance matr ix timeupdate can be applied,

+ (B't 0 ) Q Y 0 ) 0 ) TAtKf 0 RO 0 AfKf *

The total error (including both bias and variance) can be expressed as

(8.81)




and the error correlation matrix (note that P is no covariance matrix whenthere is a bias error) is

This matrix is a measure of the performance (hence the superscript p ) of themean square error. This should not be confused with the opt ima l performanceP and the measur e tha t comes out from th e Kalman filter P d . Note thatPP > P', and the difference can be interpreted as the bias. Note also th atalthough the Kalman filter is stable, PP may not even be bounded. This isthe case in navigation using an inertial navigation system, where the positionestimate will drift away.

The main use of (8.80) is a simple test procedure for evaluating the influ-ence of parameter variations or faults in he system matrices, and measurementerrors or faults in the inp ut signal. This kind of sensitivity analysisshould al-ways be performed before implementation. Tha t is, vary each single partiallyunknown parameter, and compare PP to PO. Sensitivity analysis can here beinterpreted as a Taylor expansion

dPOdB

PP PO + -AB.

The more sensitive P is to a certain parameter, th e more impor tant it is tomodel it correctly.

8.6.4. Sequential processing

For large measurement vectors, the main computational burden lies in thematrix inversion of S,. This can be avoided by sequential processing, in whichthe measu remen ts are rocessed one at a time y several consecutive measure-ment updates. Another advantage, except for computation time, is when areal-time requirement occasionally makes it impossible to complete th e mea-surement update. Then we can interrupt the measurement updates and onlylose some of the information, instead of all of it.

Assume that R, is block diagonal, R, = diag(R,', . . . ,RP) , which is oftenthe case in practice. Otherwise, the measurements can be ransformed asoutlined below. Assume also th at th e measurement vector is partitioned inthe same way, yt = Y:)~ . . . , ( Y P ) ~ ) ~nd C, = ((C,')', . . . , C?)')'. Thenwe apply for each i = 0 , . . . ,m - 1, the measurement updates




with k:lt = ktltP1 and pit = ltpl. Then we let tlt izPlnd ptlt - m

If Rt is not block-diagonal, or if the blocks are large, the measurementvector can easily be transformed to have a diagonal covariance matrix. Th at

is, let

Here denotes th e square root; see Section 8.7. For time-varying Rt thismay be of little help, since th e factorization requires as many operations as

the matr ix inversion we wanted to avoid. However, for time invariant Rt thisis the method to use.

8.7. Square root mplementation

Square root algorithms are motivated by numerical problems when updatingthe covariance matrix P th at often occur in practice:

0 P is not symmetric . This is easily checked, and one remedy s t o replaceit by 0.5(P +P T ) .

0 P is not positivedefinite. This is not as easy to check as symmetry,and there is no good remedy. One computationally demanding solutionmight be to compute the SVD of P = U D U T after each update, and toreplace negative values in the singular values in D with zero or a smallpositive number.

0 Due to large differences in the scalings of the sta tes, there might be

numerical problems in representing P . Assume, for instance, th at n, = 2and that the states are rescaled

.=( 0 1 ) " j P = ( 0 l ) P ( 0 l ) ,10'O 0 10O 0 1O1O 0

and there will almost surely be numerical difficulties in representing thecovariance matrix, while there is no similar problem for the sta te . Thesolution is a thoughtful scaling of measurements and states from thebeginning.

0 Due to a numerically sensitive state space model (e.g an almost singularR matrix), there might be numerical problems in computing P.



8.7 Square root imdementation 301

The square root implementation resolves all these problems. We define thesquare root as any matrix P1l2 of the same dimension as P , satisfying P =

P 1 / 2 P T / 2 . he reason for the first requirement is to avoid solutions of the

kind fi = ( l / & , l/& ). Note tha t sqrtm in MATLABTM defines a squareroot without transpose as P = AA. This definition is equivalent to the oneused here, since P is symmetric.

The idea of updating a square root is very old, and fundamental contribu-tions have been done in Bierman (1977), Park and Kailath (1995) and Pot ter(1963). The theory now seems to have reached a very mature and unifiedform, as described in Kailath et al. (1998).

8.7.1. Time and measurement updates

The idea is best described by studying the time update,

A first attempt of factorization

(8.85)

fails to th e condition th at a square root must be quadratic. We can, however,

apply a QR factorizationto each factor in (8.85). A QR factorization is definedas

X = &R, (8.86)

where Q is a unitary matrix such that QTQ = I and R is a upper triangularmatrix. The R and Q here should not be confused with Rt and Qt in the statespace model.

Example 8.70 Q R factorizationThe MATLABTM function qr efficiently computes the factorization

-0.1690.8971.40825.91617.4374-0.5071.27600.8165) ( 0 0.8281-0.84520.3450.4082 0 0

Applying QR factorization to th e transpose of the first factor in (8.85)

gives

(AtP;i2v,tQ:/2) = R T T . (8.87)




That is, the time update can be written

Pt+,lt = R Q QR = RT R .T (8.88)

It is clear that Q is here instrumental, and does not have to be saved afterthe factorization is completed. Here R consists of a quadratic part (actuallytriangular) and a part with only zeros. We can identify the square root as thefirst part of R ,

RT = (P:/;lt 0 ) .

To summarize , the time update is as follows.

(8.89)

Algorithm 8.5 Square root algorithm, time up date

1. Form the matrix in the left hand side of (8.87). This involves computingone new square root of Qt, which can be done by th e QR factorization( R s taken as the square root) or sqrtm in MATLABTM.

2. Apply the QR factorization in (8.87).

3. Identify the square root of Pt+llt as in (8.89).

More compactly, the relations are

(8.90)

Th e measurement update can be treated analogously. However, we willgive a somewhat more complex form th a t also provides the Kalman gain andthe innovation covariance matrix. Apply the QR factorization to t he m atrix

(8.91)

where

RT = ( Y z .X 0

(8.92)




The Kalman filter interpretations of the matrices X , Y,Z can be found bysquaring the QR factorization

R ~ Q ~ Q RR ~ R

from which we can identify

This gives the algorithm below.

Algorithm 8.6 Square root algorithm, measurement up date

1. Form the matrix in th e left-hand side of (8.91).

2. Apply the QR factorization in (8.91).

3. Identify R with (8.92),where

All in one equation:

(8.94)

Note that the Y can be interpreted as th e Kalman gain on a normalized

innovation St-”’Et E N( 0 , I ) . Tha t is, we have to multiply either th e gain Yor the innovation by the inverse of X .Remarks:




0 Th e only non-trivial part of these algorithms, is to come up with th ematrix to be factorized in th e first place. Then, in all cases here it istrivial to verify th at it works.

0 There are many ways to factorize a matrix to get a square root. TheQR factorization is recommended here for several reasons. First, therearemany efficient implementations available, e.g. MATLABTM’S qr.Secondly, this gives the unique triangular square root, which is usefulpartly because it is easily checked that it is positive definite and partlybecause inversion is simple (this is needed in the state upda te).

8.7.2. Kalman predictor

The predictor form, eliminating he measu rement update, can be derived usingthe mat rix (8.95) below. Th e derivation is done by squaring up in the sameway as before, and identifying the blocks.

Algorithm 8 7 Square root Kalman predictor

Apply the QR factorization

The state update is computed by

kt+llt = Atkt1t-l +B,,tut +AtKtS k/2SF 1/2(yt Ctit1t-l). (8.96)

We need to multiply the gain vector K t S t f 2 n the lower left corner withthe inverse of the upper left corner element S;/’. Here the triangular structurecan be utilized for matrix inversion. However, this matrix inversion can beavoided, by factorizing a larger matrix; see Kai lath et al. (1998) for details.

8.7.3. Kalman filter

Similarly, a square root algorithm for the Kalman filter is given below.




Algorithm 8.8 Square root Kalman filter

Apply the QR factorization

(8.97)

The state upda te is computed by

Example 8.7 1 DC motor: square root filtering

Consider the DC motor in Examples 8.4 and 8.9, where

A = (0 0.6703) ' B = (0.3297) ' = ( l O) *

1 0.3287.0703

Assume that we initially have

The left-hand side (LHS) required for the QR factorization in th e filter for-mulation is given by

LHS = ( $:: )0 0 0.6703.1000

The sought factor is

-1.4521 0 0-0.76350.7251 0 ")-0.15220.14450.6444 0




From this matrix, we identify the Kalman filter quantities

-0.14450.6444

S:/2 = - 1.2815.

Note that the &R factorization in this example gives negative signs of thesquare roots of St and Kt. The usual Kalman filter quantities are recoveredas

(0.3911.1015)plI = 0.1015 .1891

0.3911= (0.1015)

S1 =1.6423.

8.8. Sensor fusionSensor fusion is th e problem of how to merge information from different sen-sors. One example is when we have sensor redundancy, so we could solve thefiltering problem with either of the sensors. In more complicated cases, eachsensor provides unique information. An interesting question is posed in Blairand Bar-Shalom (1996): “Does more da ta always mean better estimates?” .The answer should be yes in most cases.

Example 8.72 Fuel consumption: sensor fusionConsider the fuel consumption problem treated in Section 3.6.1. Here we

used a sensor th a t provides a measurement of instantaneous fuel consumption.It is quite plausible th at th is signal has a small offset which is impossible toestimate from just this sensor.

A related parameter th at th e driver wishes to monitor is the amount of fuelleft in the tank. Here a sensor in the tank s used to measure fuel level. Becauseof slosh excited by accelerations, this sensor has low frequency measurement

noise (or disturbance), so a filter with very small gain is currently used by carmanufacturers.

That is, we have to sensors:



8.8 Sensor fusion 307

0 One sensor measures the derivative with high accuracy, but with a smalloffset.

0 One sensor measures the absolute value with large uncertainty.

A natura l idea, which s probably not used in today’s systems, is to mergethe information into one filter. The rate sensor can be used t o get a quickerresponse to fuel level changes (for instance, after refueling), and more impor-tantly, the long term information about level changes can be used to estimatethe offset.

A similar sensor fusion problem as in Example 8.12 s found in naviga-tion systems, where one sensor is accurate for low frequencies (an integratingsensor) and the other for high frequencies.

Example 8.73 Navigation: sensor usion

The InertialNavigationSystem ( INS) provides very accurate measure-ments of accelerations (second derivative ra ther than first derivative as nExample 8.12). The problem is possible offsets in accelerometers and gyros.

These must be estimated, otherwise the position will drift away.As a complement, low quality measurements such as the so-called baro-

alti tude are used. This sensor uses only the barometric pressure, comparesit to the gro und level, and computes an altitude. Of course, this sensor isunreliable as t he only sensor for altitude, but in combination with th e INS itcan be used to stabilize the drift in altitude.

An important issue is whether th e fusion should be made at a central

computer or in adistributed fashion. Central fusion means that we haveaccess to all measurements when deriving the optimal filter; see Figure 8.10.In contrast, in decentralized fusiona filter is applied to each measurement,and the global fusion process has access only to th e estima tes and their e rrorcovariances. Decentralized filtering has certain advantages in fault detection inthat the different sub-systems can apply a ‘voting’ strategy or fault detection.More on this is given in Chapter 11. An obvious disadvantage is increasedsignaling, since the state, and in particular its covariance matrix, is often ofmuch larger dimension than the measu remen t vector. One might a rgue tha tthe global processor can process th e measurements in a decentralized fashionto get t he advantages of bo th alternatives. However, many sensoring systemshave built-in filters.




qL+(a ) Centralized filtering

Filter 1 2Pl

IFusion

ilter 2

(b) Decentralized filtering

Figure 8.10. The concepts of centralized and decentralized filtering.

8.8.1. The nformation ilter

To understand the relation between centralized and de-centralized filtering,the information filter formulation of the Kalman filter is useful.

Linear regression models are special cases of th e signal model (8.32) with

At = I and Ct = p:. When deriving recursive versions of t he least squaressolution, we started with formulas like (see Section 5.B)

After noticing that the ma tr ix inversion requires many operations, we used

the matrix inversion lemma to derive Kalman filter like equations, where nomatrix inversion is needed (for scalar y) .Here we will go the other way around. We have a Kalman filter, and we are

attracted by the simplicity of the time update of P-' = R: above. To keepthe efficiency of avoiding a matrix inversion, let us introduce the transformedstate vector

tlt

Here, a can be interpreted as the vector ft in least squares above. Th at is, thestate estimate is never actually computed.




Algorithm 8.9 The nformation filter

Introduce two auxiliary matrices to simplify notation

The equations for the covariance update follow from the matrix inversionlemma applied to the Kalman filter equations. Note th at th e measurement

update here is trivial, while the main computational burden comes from thetime update. Th is is in contrast t o th e Kalman filter.Comments:

0 CTR,'Ct is the information in a new measurement, and C?RL1yt is asufficient statistic for updating the estimate. The interpreta tion is th atP-' contains the information contained in all past measurements. Th etime update implies a forgetting of information.

tlt

0 Note th at this is one occasion where it is importan t to factorize th e st at enoise covariance, so that Qt is non-singular. In most other applications,the term B,,tvt can be replaced with U t with (singular) covariance matrix

Qt = Bv,tQtBct.

Main advantages of the information filter:

0 Very vague prior knowledge of the sta te can now be expressed as P; =0.

0 The Kalman ilter requires an ny X ny matrix to be nverted (ny = dim y),while the information filter requires an n, X n, matrix to be nverted. I nnavigation applications, the number of state noise inputs may be quitesmall (3, say ), while the number of o utpu ts can be large.




0 A large measurement vector with block diagonal covariance Rt may beprocessed sequentially in the Kalman filter, as shown in Section 8.6.4so th at th e matrices to be inverted are considerably smaller th an ny.

There is a corresponding result for the information filter. If Qt is blockdiagonal, the time update can be processed sequentially; see Andersonand Moore (1979) (pp. 146-147).

8.8.2. Centralized usion

Much of the beauty of the Kalman filter theory is the powerful state spacemodel. A large set of measurements is simply collected in one measurementequation,

The only problem th at might occur is th at th e sensors are working in differentstate coordinates. One example is target tracking, where the sensors mightgive positions in Cartesian or polar coordinates, or perhaps nly bearing. Thisis indeed an extended Kalman filtering problem; see Section 8.9.

8.8.3. The general fusion ormula

If the re are two independent state estima tes 2' and i tz , with covariance ma-trices P' and P' respectively, fusion of these pieces of information is straight-forward:

2 =P ( ( P ' ) - W + (P2)-'22) (8.100)

P = ( ( P y ( P Z ) - y . (8.101)

It is not difficult to show that th is is the minimum variance estimator. TheKalman filter formulation of this fusion is somewhat awkwardly described bythe sta te space model

ZtS1 = u t , Cov(vt) = col

( = (:>t + (:;) Cov(ei) = Pi.

TO verify it , use the information filter. The infinite state variance gives P;;,, =

0 for all t . The measurement update becomes

P$' = 0 +c R -1 c = (Pit)-l + (PZt)-l.




8.8.4. Decentralized fusion

Suppose we get estimates from wo Kalman filters working with th e same statevector. The tota l sta te space model for th e decentralized filtersis

Th e diagonal forms of A ,B and C imply:

1. The u pda tes of the estimates can be done separately, and hence decen-

tralized.

2. The state estima tes will be independent.

Thu s, because of independence th e general fusion formula(8.100) applies. Thestate space model for decentralized filtersand the fusion f i l te ris now

(::1::)t + l = (? it 0 ) p )+ 7 j t ")1)0 0 1

C O V ( V ~ )Q , COV(V:) = Q, COV(V;) = 001.

That is, the Kalman filter applied to this sta te space model provides thefusioned state estimate as the thir d sub-vector of the augmented s ta te vector.

The only problem in this formulation is that we have lost nformation.The decentralized filters do not use the fact that they are estimating hesame system, which have only one state noise. It can be shown th at hecovariance matrix will become too small, compared to th a t which the Kalmanfilter provides for the model

(8.103)

To obtain the optima l state estimate, given the measurements an d thestate space model, we need to invert the Kalman filter and recover t he raw




information in the measurements.One way to do this s to ask all decentralizedfilters for the Kalman filter transfer functions tlt = H t ( q ) y t ,and then applyinverse filtering. I t should be noted that th e nverse Kalman filter is likely to be

high-pass, and thus amplifying numerical errors. Perhaps a better alternativeis as follows. As indicated, it is th e information in th e measurements, not themeasurements themselves tha t ar e needed. Sufficient statis tics are providedby the information f i l te r ;see Section 8.8.1. We recapitulate t he measurementupdate for convenience:

tit = + l + c, R, YtT -1

where Gtlt = ?$'Qt. Since the measurement error covariance in (8.103) isblock diagonal, the time upda te can be split into wo parts

&tit = l + <C,','(W Yt + m T ( R t > Yt,

PG1 =P$', + ( C ; ) T ( @) - l c ; + ( C ; ) T ( R y C ; .

&& + ( c ; ) T ( R y y :

(Piit)-l (P;lt-l)-l + (c;)T(R:)-lc;.

1 -1 1 2 -1 2

Each Kalman filter, interpreted as information filters, has computed a mea-surement update

By backward computa tions, we can now recover th e information in each mea-surement, expressed below in the available Kalman filter quantities:

i -1 Yt =h - = (P&-li:it - (P&l) q - 1-1 i

( C y ( R ; ) - l c ; = ( P & - l - (P, lt-l)-l.

The findings are summarized and generalized to several sensors in the algo-rithm below.

Algorithm 8.70 Decentralized iltering

Given the filter and prediction quantities from m Kalman filters working onthe sam e sta te space model, the opt ima l sensor fusion is given by t he usualtime update and the following measurement update:

m



8.9 The extended Kalmanilter 313

The price paid for the decentralized structure is heavy signaling and manymatrix inversions. Also for this approach, here might be numerical prob-lems, since differences of te rms of the same order a re to be computed. These

differences correspond to a high-pass filter.

8.9. The extended Kalman ilter

A typical non-linear state space model with discrete time observations is:

(8.104)

(8.105)

Here and in the sequel, T will denote the sampling nterval when used asan argument , while it means matrix transpose when used as a superscript.The measurement and time update are treated separately. An initializationprocedure is outlined in the simulation section.

8.9.1. Measurement u pdate

Linearization

Th e most straightforward approach to linearize (8.105) is by using a Taylorexpansion:

H k T

With a translation of the observation, we are approximately back to th e inear

case,

Y k T = Y k T - h ( 2 k T l k T- T ) +H k T 2 k T l k T- T HkTxkT + e k T, (8.106)

and the Kalman filter measurement update can be applied to (8.106).

Non-linear transformation

An alternative to making (8.105) linear is to apply a non-linear transformationto th e observation. For instance, if n y = nx one can try y = h-l(y), whichgives H,+T = I . The problem is now to transform the measurement covari-ance R to R. This can only be done approximately, and Gaussianity is notpreserved. This is illustrated with an example.



31 4 Kalman filterina

Example 8.14 Target tracking: output transformation

In radar applications, range r and bearing I to th e object are measured,while these co-ordinates are not suitable to keep in th e st at e vector. Th euncertainty of range and bearing measurements is contained in the variancesa: and a;. The measu remen ts in polar coordinates can be transformed intoCartesian position by

The covariance matrix R of th e measurement error is neatly expressed in Liand Bar-Sha lom (1993) as

b + cos(213) in(213)R =

2 sin(213) b - cos(213)(8.107)

(8.108)

Indeed, this is an approximation, but it is accurate for ra i /or < 0.4, which is

the case normally.

8.9.2. Time update

There are basically two alternatives in passing from a continuous non-linearmodel to a discrete linear one: first linearizing and then discretizing; or theother way around. An important aspect is to quantify th e linearization error,which can be included in the bias error (model mismatch). The conversion ofstate noise from continuous to discrete time is mostly a philosophical problem,since the alternatives at han d will give about the same performance. In thissection, the discrete time noise will be characterized by its assumed covariancematrix Q and Section 8.9.4 will discuss different alternatives how to computeit from Q.

Discretized linearization

A Taylor expansion of each component of th e state in (8.104) around thecurrent estimate 2 yields

1

2= f @ ) + fi (2)(. - 2 ) + -(. - ) T j ; [ ) Z - 2 ) +Vi, , (8.109)



8.9 The extended Kalman filter 315

where E is a point n the neighborhood of z and 2 . Here f l denotes thederivative of the it h row of f with respect to the st ate vector z, and f ssimilarly the second derivative. We will not investigate transformations of th e

state noise yet. It will be assumed th a t th e discrete time counterpart has acovariance matrix Q.

Example 8.75 Target tracking: coordinated turns

In Example 8.3, the dynamics for the state vector (z('), d 2 ) , W( ), d 2 ) , w ) ~is given by:

(8.110)

This choice of s ta te is natural when considering so called coordinated turns,where turn rate W is more or less piecewise constant. The derivative in theTaylor expansion (8.109) is

0 0 1 0 00 0 0 1

f L ( x ) = [0 0 0 -; +l .o o w

0 0 0 0

(8.111)

This is th e solution of (8.109) found by integration from to t+T, and assumingxt = 2 . With the state vector (z('), d 2 ) , W , h, w ) ~ ,he state dynamics become:

(8.112)




The derivative in the Taylor expansion (8.109) is now

0 0 cos(h)wsin(h) 0

0 0 sin(h)cos(h) 00 0

0 0 0 0 1

0 0 0 0 0

(8.113)

From standard theory on ampled systems (see for instance Kailath (1980))the continuous time system (8.109), neglecting the second order rest term, hasthe discrete time counterpart

(8.114)

The time discrete Kalman filter can now be applied to this and (8.105) toprovide a version of one of the wellknown Extended Kalman Filters(here calledEKF 1) with time update:

(8.115)

(8.116)

This will be referred to as the discretized linearizationapproach.

Linearized discretization

A different and more accurate approach is to first discretize (8.104). In somerare cases, of which tracking with constant tu rn ra te s one example, the state

space model can be discretized exactly by solving the sampling formula

xt+T =xt +lt+T( x ( 7 ) ) d r (8.117)

analytically. Th e solution can be written

x t+T =g( x t ) - (8.118)

If this is not possible to do analytically, we can get a good approximation ofan exact sampling by numerical integration. This can be implemented usingth e discretized linearization approach in th e following way. Suppose we havea function [X 1=tu(x,P T ) for the time update. Then the fast samplingmethod, in MATLABTMformalism,



8.9 The extended Kalman filter 317

f o r i = l : M ;

[x,PI=tu(x,P,T/M);end ;

provides a more accurate time update for the sta te. This is known as theiterated Kalman jilter. In the limit, we can expect that the resulting sta teupdate converges to g ( x ) . An advantage with numerical integration is that itprovides a natural approximation of th e st at e noise.

Example 8.76 Target tracking: coordinated urns and exact sampling

Consider the tracking example with xt = (Xt 1) ,xt 2) ,ut, ht, The ana-lytical solution of (8.117) using (8.112) is

(8.119)

The alternate state oordinates X = (xt 1) ,xt 2) , , ut 2) ,wt)T give, with (8.110)in (8.117),

V (1)ut

2)( 2 ) - (2) + -(l - cos(wtT)) + in(wtT)

W txt+T -xt Wt

v )T =vi1) cos(wtT) - vi2)sin(wtT)

‘t+T2) -‘- sin(wtT) +vi2)cos(wtT)

(8.120)

These calculations are quite straightforward to compute using symbolic com-putation programs, such as MapleTMor MathematicaTM.

A component-wise Taylor expansion around a point 2 gives1

= g ( i ) ( k ) + (g(i))’(2)(xt - 2 ) + z ( x t - 2 ) g i ) I ( t ) ( x t - 2 ) . (8.121)




Applying the time discrete Kalman filter to (8.121) and (8.106), and neglectingthe second order Taylor term in the covariance matrix update , we get thesecond version of the extended Kalman filter (EKF 2), with time update:

This will be referred to as the linearized discretizationapproach, because wehave linearized the exact time upd ate of the state for updating its covariancematrix.

One further alternative is to include the second order Taylor term in th estate time upda te. Assuming a Gaussian distribution of Q t , a second ordercovariance update is also possible (see equation (3-33) in Bar-Shalom andFortmann (1988)). The component-wise time update is given by (EKF 3):

(8.124)

(8.125)

(8.126)

8.9.3. Linearization error

The error we make when going from (8.109) to (8.114) or from (8.118) to(8.121) depends only upon the size of (x - 2 ) T ( f ( i ) ) ” ( J ) ( ~ 2 ) and, for thediscrete case, (X - 2 ) T ( g ( z ) ) ” ( ( ) ( x 2 ) , respectively. This error propagatesto th e covariance matrix upd ate and for discretized linearization also to th esta te update . This observation implies th a t we can use th e same analysis forlinearized discretization and discretized linearization. Th e Hessian ( f i ) ” ofcomponent i of f ~ ) ill be used for notational convenience, but the same ex-pressions holds for (S i )” . Let +T = Z ~ + T denote the state predictionerror. Then we have

Care must be taken when comparing different state vectors because 11x-2112is not an invariant measure of state error in different coordinate systems.

Example 8.7 7 Target tracking: best state coordinatesIn target tracking with coordinated turns, we want to determine which of

the following two coordina te systems s likely to give the smallest linearization




error:

For notational simplicity, we will drop the tim e index in this example. Forthese two cases, the state dynamics are given by

f c v ( z ) (v(1), ?A2), -wv(2), V 1), )T

fpv(x) (v cos(h), v sin(h), 0, W , ) ~ ,

respectively.

Suppose the ini tial sta te error is expressed in ~ ~ 5 , , ~ ~ ~(5(1))2 (53(2))2 ++ (V(2))2 + 6’. he problem is th at th e error in heading angle is nu-

merically much smaller tha n in velocity. We need to find th e scaling matrixsuch that I l5 , I I~is a comparable error. Using

we have

M W - c)2+v 2 (h - L)?

The approximation s accurate for angular errors ess than, say, 40’ and relativevelocity errors less than, say, 10%. Thus, a weighting matrix

JP, = diag (1,1,1, , 1)

should be used in (8.127) for polar velocity and the identity matrix J,, forCartesian velocity.

The error in linearizing state variable i is




The total error is then

(8.129)i = l

The above calculations show th at if we start with the same initial error n twodifferent state coordinate systems, (8.128) quantizes the linearization error,which can be upper bounded by taking the Frobenius norm instead of two-norm, on how large the error in the tim e upda te f the extended Kalman iltercan be.

Th e remaining task is to give explicit an d exact expressions of t he restterms in the Taylor expansion. The following resul ts are taken rom Gustafssonand Isaksson (1996). For discretized linearization, th e Frobenius norm of th e

Hessian for the state trans ition function f ( x ) s

i = l

for coordinates ( d l ) , 2 ) ,d’ , 2 ) , u ~ ,nd

for coordinates (&I, 2 ) , v , h , w ) ~ .ere we have scaled the result with T 2 oget the integrated error during ne sampling period, so as to be ble t o compareitwith the esu lts below. For linearized discretization, th e correspondingnorms are

5c I(g,$))”Il$ 4T2 +T4 (1 +v2) (8.130)

i=l

1

240 12600088640( U T ) 6+0 ( ( U T ) 8 )




and

(8.131)

1

240 12600088640 W T y +0 W T ) 8 )

The approximation here is tha t w 2 is neglected compared t o 471. The proof isstraightforward, but the use of symbolic computation programs as Maple orMathematica is recommended.

Note first that linearized discretization gives an additive extra term thatgrows with the sampling interval, bu t vanishes for small sampling intervals,compared to discretized linearization. Note also that

That is, the continuous time result is consistent with th e discrete time result.We can thus conclude the following:

The formulas imply an upper bound on th e rest term tha t is neglectedin the EKF. Other weighting matrices can be used in the derivation toobtain explicit expressions for how e.g. an initial position error influencesthe time upda te. This case, where J = diag(1, 1, 0, 0, 0) , is particularlyinteresting, because now both weighting matrices are exact, and no ap-proximation will be involved.

The Frobenius error for Cartesian velocity has an extra term v 2 T 4com-pared to the one for polar velocity. For T = 3 and W = 0 this impliestwice as large a bound. For W = 0, th e bounds converge as T + 00.We can summarize the results in the plot in Figure 8.11 illustrating theratio of (8.130) and (8.131) for W = 0 and three different velocities. Asnoted above, the asymptotes are 4 and 1, respectively. Note the hugepeak around T 0.1. For T = 5, Cartesian velocity is only 50% worse.

8.9.4. Discretization of state noiseTh e state noise in (8.104) has hither to been neglected. In his section, wediscuss different ideas on how to define the state noise Q in the EKF.




Figure 8.1 1. Ratio of the upper bounds 8.130) and (8.131) versus sampling interval. Carte-sian velocity has a larger upper bound for all sampling intervals.

There are basically five different alternatives for computing a time discretestate noise covariance from a time continuous one. Using one of the alternatemodels

we might try

(8.132)

(8.133)

(8.134)

(8.135)

(8.136)

All expressions are normalized with T , so th at one and the same Q can beused for all of the sampling intervals.

These methods correspond o more or less ad hoc assumptions on the statenoise for modeling the manoeuvres:

a. ut is continuous white noise with variance Q.




15

10

5

0

-5

-1 0

-25 l0 1 2 3 4 5Time [samples]

Figure 8.12. Examples of assumptions on th e st at e noise. The arrow denotes impulses incontinuous time and pulses in discrete time.

b. vt = vk is a stochastic variable which is constant in each sample intervalwith variance Q / T. That is, each manoeuvre is distributed over thewhole sample interval.

c. vt is a sequence of Dirac impulses active immediately after a sample istaken. Loosely speaking, we assume k = f(x) +C k J ~ B ~ T - ~here T J ~sdiscrete white noise with variance T Q.

d. vt is white noise such that its total nfluence during one sample interval is

T Q .

e. vt is a discrete white noise sequence with variance T Q. That is, we assumeth at all manoeuvres occur suddenly immediately after sample time, so

Z t S 1 = S b t +U t ) .

The first two approaches require a linear time invariant model for th e st at enoise propagation to be exact. Figure 8.12 shows examples of noise realizationscorresponding to assumptions a-c.

It is impossible to say a priori which assumption is the most logical one. In-stead, one should investigate the alternatives (8.132)-(8.136) by Monte Carlosimulations and determine the importance of this choice for a certain trajec-tory.




8.10. Whiteness based change detection using theKalman filter

The simplest form of change detection for state space model is to apply oneof the distance measures and stopping rules suggested in Chapter 3, to th enormalized innovation, which takes the place as a signal. Th e principle isillustrated in Figure 8.13.

%It-lU t c

W

KFE t

Ytc CUSUM c Alarm

W Pt

4

Figure 8.1 3. Change detection as a whiteness innovation test (here CUSUM) or the Kalmanfilter, where th e alarm feedback controls the ada pta tio n gain.

The distance measures t is one of the following:

0 Normalized innovations

S t = s t -1 / 2 &t . (8.137)

For vector valued measurements and residuals, we use the sum

S t = -l,,S, T -1 /2&, ,&

(8.138)

which will be N ( 0 , l ) distributed under Ho. Here l,, is a vector with nyunit elements.

0 The squared normalized innovations

st = Et S, E t - ny.-1 (8.139)

The interpretation here is that we compare the observed error sizes with thepredicted ones (St) from prior information only. A result along these lines ispresented in Bakhache and Nikiforov (1999) for GPS navigation. This ideais used explicitely for comparing confidence intervals of predictions and priormodel knowledge in Zolghadri (1996) to detect filter divergence.Generally, animportant role of the whiteness test is to monitor divergence, and this mightbe a stand-alone application.

The advantage of (8.139) is tha t changes which increase th e variance with-out affecting the mean can be detected as well. The disadvantage of (8.139)



8.10 Whiteness based chanae detection usina the Kalman ilter 325

0 10 20 30 40 50100 l

0 10 20 30 40 50

0 10 20 30 40 50

Figure 8.14. Different ways to monitor th e innovations from a Kalman filter. Firs t, the twoinnovations are shown. Second, th e normalized square (8.139) (without subtraction of n l ) .

Third, the test statistics in the CUSUM test using the sum of innovations as input.

is the sensitivity to noise scalings. It is well known th at the s ta te estimatesare not sensitive to scalings, but the covariances are:

The usual requirement th at th e distance measure st is zero mean during non-faulty operation is thus sensitive to scalings.

After an a larm, an appropria te ac tion is to increase th e process noise co-variance matrix momentarily, for instance by multiplying it with a scalar factorQ.

Example 8.78 Target t racking: whiteness est

Consider the target tracking example. Figure 8.15(a) shows the tra jec toryand estimate from a Kalman filter. Figure 8.14 visualizes th e size in differentways. Clearly, the means of the innovations (upper plot) are ndica tors ofmanoeuvre detection, a t least after th e transient has faded way around sample25. Th e normalized versions from (8.139) also become large in th e manoeuvre.The test statistics from the CUSUM test using th e innovations as input areshown in the lower plot. I t is quite easy here to tun e th e threshold (and thedr if t) to get reliable change detection. Manoeuvre detection enables improvedtracking, as seen in Figure 8.15(b).




43.5

1.5

0.5

-0.5 I -0.5

easured position-1 - - - t Estimatedosition 1 easured position

-* Estimated osition

0 1 2 3 4 5 6 1 2 3 4 5 6 7

Figure 8.15. (a ) Tracking using a Kalman filter, and (b) where a residual whiteness test isused. The CUSUM parameters are a threshold of 5 and a drift of 0.25. The test statistic isshown as the last subplot in Figure 8.14.

8.1 1. Estimation of covariances in state space

models

A successful design hinges on good tuning of the process noise covariance ma-trix Q. Several attempts have been presented to estimate Q (Bohlin, 1976;Gutman and Velger, 1990; Isaksson, 1988; Valappila and Georgakisa, 2000;Waller and Sax&, 2000), but as we shall see, there is a fundamental observ-ability problem. At least if both Q and R are to be estimated.

A state space model corresponds to a filter (or differential equation). Fora model with scalar process and measurement noise, we can formulate anestimation problem from

where vt and et are independent white noises with unit variances. Form thevectors Y = ( y 1 , y 2 , . . . Y N ) ~ , = (211,212 ,..., w N ) ~ , = ( e l , e 2,... e N ) Tand use the impulse response matrices of the sta te space model (or filtermodel) :

Y = H v V+H e E .




Assuming Gaussian noise, the distribution for the measurements is given by

Y E N(0, H,H:a; +H,H:a:).

P(uv,ue)

The maximum likelihood estimate of the parameters is thus

arg min Y P (o,, e)Y + logdet P(u,, a,).-1

(Tu u e

For Kalman filtering, only the ratio u,/u, is of importance, and we can use,for instance, o, = 1.

Example 8.19 Covariance estimation in state space models

Consider a one-dimensional motion model

The true parameters in th e simulation are N = 36, a$ = 1 and aO = 3.

Figure 8.16(a) shows th a t th e likelihood gets a well specified minimum.Note, however, th at th e likelihood will not become convex (compare to Figure1.21(a)), so the most efficient standard tools for optimization cannot be used.Here we used a global search with a point grid approach.

For change detection, we need good values of both U, and U,. The jointlikelihood for data given both a,, U, is illustrated in Figure 8.16(b). Note herethat the nfluence of o, s minor, and optimization s not well conditioned. Thismeans that the estimates are far from the true values in both approaches.

As a final remark, the numerical efficiency of evaluating the likelihood givenhere is awkward for large number of data , since large impulse response matricesof dimension N X N has to be ormed, and inversions and determinants of theirsquares have to be computed. More efficient implementation can be derivedin th e frequency domain by applying Parseval’s formula to the likelihood.

8.1 2. Applications

8.1 2.1. DC motorConsider the DC motor lab experiment described in Sections 2.5.1 and 2.7.1.It was examined with respect to system changes in Section 5.10.2. Here we




Minimization w .r.t . velocity no is eq

400 -

350

- 000

250

200

150 -

100; 2 4 6 8 I10

0

( 4 (b)

Figure 8.16. Likelihood for ou when o e = 1 is assumed (a) , and likelihood for ou , e (b).

apply the Kalman filter with a whiteness residual test, where th e goal is todetect the torque disturbances while being insensitive to system changes. Th etest cases in Table 2.1 are considered. Th e squared residuals are fed to theCUSUM test with h = 10 and v = 2. The state space model is given in (2.2).

From Figure 8.17 we conclude the following:

0 Th e design of th e CUSUM test is much simpler here than in the para-metric case. The basic reason is that state changes are simpler to detectthan parameter changes.

0 We get ala rms at all faults, both system changes and disturbances.

0 It does not seem possible t o solve th e fault isolation problem reliablywith this method.

8.1 2.2. Target racking

A typical application of target tracking is to estimate the position of aircraf t.A civil application is Air f i u f i c Control (ATC), where the traffic surveillor ateach airport wants to get the position and predicted positions of all aircraft ata certain distance from the airport . There are plenty of military applicationswhere, for several reasons, the position and predicted position of hostile air-craft are crucial information. The re are also a few other applications where thetarget is not an aircraft. See, for instance, the highway surveillance problembelow.




Modelesiduals.ataet 1 Modelesiduals,ataet 22 3 2

1 -

0

1 -

-1

-22

~

-1

Test statlstlcs rom whlteness change detection12

Test statlstlcs rom whiteness change detectton

( 4 (b)Modelesiduals.ataet 3 Modelesiduals,ataet 4

2 3 2

1 - 1 -

JLo b

-1

-22

-1

Testtatlstlcs from whltenesshangeetectionesttatlstlcs f rom whitenesshangeetectton

Figure 8.1 7. Simulation error using he st at e pace model in (2.2) and test statistics rom th eCUSUM test . Nominal system (a), with torque disturbances (b), with change in dynamics(c) and with both disturbance and change (d).

Th e classical sensor for target tracking is radar, that at regular sweeps

measures range and bearing t o th e object. Measurements from one flight testare shown in Figure 8.18.

Sensors

Th e most characteristic features of such a target tracking problem are theavailable sensors and he dynamical model of the arget. Possible sensorsinclude:

0 A radar where bearing an d range to th e object are measured.

0 Bearing only sensors including Infra-Red (IR) and radar warning sys-tems. The advantage of these in military applications is that they are




Measured velocitiesfmm INS

0.5

200 400 600 800 1000200 1400600

-2000 I-20 2 6 8 10 12 14 16 18

200 400 600 800 1000200 1400600 East-WestX 104

Figure 8.18. Measurements and trajecto ry from dead reckoning.

passive and do not reveal the position of the tracker.

0 Th e range only information in a GPS should also be mentioned here,although the tracker and target are th e same objects in navigation ap-plications as in a GPS.

0 Camera and computer vision algorithms.

0 Tracking information from other trackers.

Possible models are listed in the ATC application below.

Air Traffic Control (ATC)

There are many sub-problems in target tracking before th e Kalman filter canbe applied:

0 First, one Kalman filter is needed for each aircraft.

0 Th e aircraft is navigating in open-loop in altitude, because altitude isnot included n th e st at e vector. The aircraft are supposed to followtheir given altitude reference, where a large navigation uncertainty isassumed.

0 The associationproblem: each measurement is to be associated with anaircraft trajectory. If this is not working, the Kalman filters are fed with

incorrect measurement, belonging t o another aircraft.0 The radar detects false echoes that are referred to as clutter. These are

a problem for the association algorithm.




See Bar-Shalom and For tmann (1988) and Blackman (1986) for more infor-mation. Th e application of the IMM algorithm from Chapter 10 to ATC isreported in Yeddanapudi et al. (1997).

Choice of state vector and model

Th e most common model for target racking by f a r is a motion model ofthe kinds listed below. The alternative is to use th e aircraft flight dynamics,where the pilot input included in ut considered as being too computationallyintensive. Their are a few investigations of this approach; see, for instance,Koifman and Bar-Itzhack (1999).

Proposed dynamical models for the target that will be examined here in-

clude:1. A four s ta te linear model with X = (X( ), d 2 ) , dl) , ~ ( ~ 1 ) ~ .

2. A four state linear model with z = ( z ( ' ) , ~ ( ~ ) ,( ), ~ ( ~ 1 ) ~nd a time-varying and state dependent covariance matrix Q ( z t ) which is matchedto th e assumption that longitudinal acceleration is only, say, 1% of thelateral acceleration (Bauschlicher et al., 1989; Efe and Atherton, 1998).

3. A six sta te linear model with X = (X( ), d 2 ) , dl) , d 2 ) , U ( ) , ~ ( ~ 1 ) ~ .

4. A six state linear model with z = (&I, 2 ) , v( ), d 2 ) , U ( ) , ~ ( ~ 1 ) ~nd atime-varying and sta te depend ent covariance matrix Q ( z t ) .

5. A five state non-linear model with X = (X( ), d 2 ) , dl) , d 2 ) , u ) ~ ,ogetherwith an extended Kalman filter from Section 8.9.

In detail, the coordinated turn assumption in 2 and 4 implies tha t Q shouldbe computed by using qv = O.O1qw below:

6 = a r c t a n ( d 4 ) / J 3 ) )

+qw sin2(6) qv - w ) sin(6) cos(6)qv - q w ) sin(6) cos(8) qv sin2 6) + qw cos2 6)

T

The expressions for B, hold for the assumption of piecewise constant noisebetween the sampling instants. Compare to equations (8.132)-(8.136).




-+ Measured pos i t ion

-2 -1 0 1 2 3 4 5

104

Figure 8.19. Target trajectory and radar measurements.

Example 8.20 Target tracking: choice of state vector

We will compare the five different motion models above on th e simulatedtarget in Figure 8.19. Th e difference in performance is not dramatic by onlystudying the Mean Square Error (MSE) , as seen in Figure 8.20(a). In fact , itis hard to visually see any systematic difference. A very careful study of theplots and general experience in tuning this kind of filters gives the followingwell-known rules of thumb:

0 During a straight course, a four state linear model works fine.

0 During manoeuvres (as coordinated urns), he six state inear modelworks better than using four states.

0 The five state non-linear model has the potential of tracking coordinatedturns particularly well.

The to tal RMSE and the corresponding adaptation gain q used to scale Q isgiven below:

Method243 330 241 301 260MSE

1 2 3 4 5

10 100 0.1 1 0.0001

The adapta tion gains are not optimized. Several authors have proposedthe use of two parallel filters, one with four and one with six states, and




n5 Onetepheadrediction1 5 - ' - - , I our state llnear model

- - Four state lnear model, CT Q- - Six state inear model

- ive state non-llnear CT modelSix state linear model,CT Q

I1.1

W

Sample number

500I I our state inear model

- - Four state inear mod el, CT Q- - Six state h ea r model00 t

Six state linear model,CT QFive state non-linear CT model

-300; '20 40 60 80 100

SamDle number

Figure 8.20. Performance of different state models for tracking t he object in Figure 8.19.In (a) t he m ean sq uare error for th e different methods are plotted (separated by an offset),and in (b) the velocity estimate is compared to the tru e value.

amanoeuvre detection algorithm to switch between them. The urn ratemodel with five states is often used for ATC. Note however, that according tothe examples in Section 8.9, there are at least four different alternatives forchoosing state coordinates and sampling for th e turn ra te model.

An intuitive explanation to th e rules of thumb above can be derived fromgeneral adaptive filtering properties:

Use as parsimonious a model as possible Th at is, four states during astraight course and more states otherwise. Using too many states resultsin an increased variance error.

With sufficiently many s tates, a ll manoeuvres can be seen as piecewise

constant states , and we are facing a segmentation problem rather tha na tracking problem. Th e five st at e model models coordinated turns as apiecewise constant turn rate W .

Highway traffic surveillance

The traffic surveillance system described here aims at estimating the positionof all cars using da ta from a video camera. A snapshot from the camera isshown in Figure 8.21. As an application in itself, camera stabilization can bedone with the use of Kalman filters (Kasprzak et al., 1994).

Figure 8.22(a) shows the measured position from one track, together withKalman filtered estimate using a four state linear model. Figure 8.22(b) shows




Figure 8.21. A helicopter hovering over a highway measures the position of all cars forsurveillance purposes. Data and pictu re provided by Cent re for Traffic Simulation Research,Royal Institute of Technology, Stockholm.

the time difference of consecutive samples in th e east-west direction beforeand after iltering. Clearly, the filtered estimates have much less noise. It

is interesting to note t ha t the uncertainty caused by the bridge can also beattenuated by the filter.

The histogram of the estimation errors n Figure 8.23, where the transientbefore sample number 20 is excluded, shows a quite typical distribution forapplications. Basically, the estimation errors are Gaussian, but the tail s areheavier. This can be explained by da ta outliers. Another more reasonableexplanation is that the estimation error is the sum of the measurement errorand manoeuvres, and the latt er gives rise to outliers. The conclusion fromthis example, and many others, is tha t th e measurement errors can in manyapplications be considered as Gaussian. This is important for the optimalityproperty of the Kalman filter. It is also an important conclusion for th e changedetection algorithms to follow, since Gaussianity is a cornerstone of many ofthese algorithms.

Bearings only target tracking

The use of passive bearings only sensors is so import ant tha t it has becomeits own research area (Farina, 1998). Th e basic problem is identifiability. Ateach measurement time, the targe t can be anywhere along a line. Th at is,knowledge of maximal velocity and acceleration is necessary information fortracking t o be able to exclude impossible or unrealistic trajectories. It turns




Car position from computer vision system

1

-2 ~

-3

-4~-

Dlfferenllaled OSltlOn ~nwest-east data2 - riginal data

Kalman iltered data

20 400 80 100 120 140 11Time [samples]

(b)

Figure 8.22. (a) Position from computer vision system and Kalman filter estimate (withan offset 10 added to north-south direction); (b) differentiated data in east-west directionshows th at th e noise level is significantly decreased after filtering.

Residual

Figure 8.23. Distribution of the estimation errors in Figure 8.22 after sample number 20.

out that even then identifiability is not guaranteed. I t is helpful if the ownplatform is moving. There are papers discussing how th e platform should bemoving to optimize tracking properties.

It is well known that for a single sensor, the use of spherical state coor-dinates has an important advantage: th e observable subspace is decoupledfrom the non-observable. The disadvantage is tha t th e motion model must benon-linear. We point here at a special application where two or more bearingonly sensors are exchanging information. Since there is no problem with ob-servability here, we keep the Cartesian coordinates. If they are synchronized,




X 10

0 0 2 0 4 0 6 0 8 1 l 2 1 4 1 68 2

X 10

X 10

12

10

8

6 -

4 -

2 -

Estimated position

(a ' 0

0 0 2 0 4 0 6 0 8 l l 2 I 4 l 6 l 8

Figure 8.24. Bearing measurements and target trajectory (a) and tracking performance (b).

then the best approach is probably to transform the angle measurements t oone Cartesian measurement. If they are not synchronized, then one has to usean extended Kalman filter. Th e measurement equation is then linearized asfollows:

Here R is the estimated range to th e target and C th e estimated derivative ofthe arctangent function. Th e sensor location at sampling time t is denoted Zt.

Example 8.21 Bearings only tracking

Figure 8.24(a) shows a trajectory (going from left t o r ight) and how fivesensors measure the bearing (angle) to the objec t. The sensors are not syn-chronized at all in this simulation. Figure 8.24(b) shows th e Kalman filterestimated position.




Figure 8.25. GPS satellites in their orbits around the ear th.

8.1 2.3. GPS

Satellite based navigation systems have a long history. I t all started with the

TRANSIT project 1959-1964 which included seven satellites. The GPS ystemused today was already nitiated 1973 and completed 1994. The AmericanGPS and Russian alternative GLONASS both consist of 24 satellites. Th eGPS satellites cover six planes, each separated 55' at a distance of 22200 kmfrom the earth , as illustrated in Figure 8.25. The orbit time is 12 hours.

The key to th e accuracy is a very accu rate atom clock in each satellite,which has a drift of less than 10-13. Th e clocks are, furthermore, calibra tedevery 12 hours when they pass th e USA. Th e satellites transmit a message(10.23 MHz) at regular and known time instants. The message includes an

identity number and the position of the satellite.Figure 8.26 illustrates the principles in two dimensions. The result in (a)

would be obtained if we knew the time exactly, and there is one unique pointwhere all circles intersect. If the clock is behind true time, we will computetoo short a distance to th e satellites, an d we get the plot in (b) . Conversely,plot (c) is obtained if the clock is ahead of tru e time. I t is obvious tha t bytrial and error, we can find the correct time and the 2D position when threesatellites are available.

If there are measurement errors and other uncertainties, we get more re-alistically thicker circles as in Figure 8.26(d), and the intersection becomesdiffuse. That is, the position estimate is inaccurate.

The GPS receiver has a clock, but it is not of the same accuracy as the




Figure 8.26. Computation of position in 2D from distance computations in case of a perfectclock (a) and where the clock bias gives rise to a common distance error to all satellites asin (b) and (c) , and finally, in th e case of measurement uncertainty (d).

satellites. We therefore have to include its error, th e clock bias, into the statevector, whose elements of primary interest are:

d l East-west osition [m]d 2 ) North-south position [m]z ( ~ ) Up-down position [m]z ( ~ ) Clock bias times velocity of light [m]

Note the scaling of the clock bias to improve the numerical properties (allsta tes in the same order of magnitude).




Non-linear least squares pproach

The time delay At of the transmitted signal to each satellite j is computed,which can be expressed as

where R(j ) s the distance to and X ( j ) he position of satellite j , respectively,and c is th e velocity of light. We have four unknowns, so we need at least fourequations. Four visible satellites give a non-linear equation system and morethan four gives an over-determined equation system. These can be solved byiterative methods, but that is not how GPS systems are implemented usually.

Linearized least squares approach

Instead, we use a previous position estimate and linearize the equations. Wehave

Th e linearized measurement equation is thu s

The first part of the regression vector C ), which will be used in a Kalmanfilter, is given by




Collecting all satellite measurements ( 5 4) n one vector jjt and the regressionvectors in a matrix C ( x ) gives th e least squares solution

Zt = ( C ( L l ) T C ( & l ) ) C(&l)TYt,

where 2t-1 is the estimate from th e previous measurements. Note th at a very

accurate clock compensation is obtained as a by-product in gt 4)

This principle is used in some GPS algorithms. The advantage of thissnapshot method is tha t no dynamical model of th e GPS movement is needed,so that no prior information about the application where it is used is needed.The drawback is that no information is obtained when less than four satellitesare visible.

Linearized Kalman filter approach

The other alternative s to use an extended Kalman filter. The following statespace model can be used:

xt+l = A x t +

A =

‘ l O O O T , 0 0 0O l O O O T , 0 00 0 1 0 0 O T , O

0 0 0 1 0 0 O T ,0 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 0

\ o o o o o 0 0 1

The introduced extra states ared 5 ) East-west velocity [m/s]z ( ~ ) North-south velocity [m/s]z ( ~ ) Up-down velocity [m/s]d 8 ) Clock drif t imes velocity of light [m/s]

Here the movement of the GPS platform is modeled as velocity randomwalk, and the clock drift is explicitely modeled as random walk.

Th e use of a whiteness-based change detector in conjunction with he EKFis reported in Bakhache and Nikiforov (1999). An orbit tracking filter similarto this problem is found in Chaer et al. (1998). Positioning in GSM cellularphone systems using range information of th e same kind as in the GPS isreported in Pent et al. (1997).




Performance

Th e errors included in the noise et and their magnitude in range error arelisted below:

0 Satellite trajectory error (1 m).

0 Satellite clock drift (3 m).

0 Selected availability (SA) (60 m). Up to the year of 2000, the US militaryadded a disturbance term to the clock of each satellite. This ter m wastransmitted separately and coded so it was not available to civil users.The disturbance was generated as filtered white noise, where the timevariations were designed so it takes hours of measurements before thevariance of average becomes significantly smaller. One solution to th athas been developed to circumvent his problem is to use DigerentialG P S ( D G P S ) .Here fixed transmitters in base stations on the groundsend out their estimate of the clock bias te rm , which can be estimatedvery accurately due to the position of the base station being knownexactly. Such base stations will probably be used even in the fu tu re toimprove positioning accuracy.

0 Refraction in the troposphere 2 m) and onosphere (2 m) , an d eflections

(1 m).

0 Measurement noise in the receiver (2 m).

All in all, these error sources ad d up to horizontal error of 10 m (100 m withthe SA code). Using differential GPS, the accuracy is a couple of meters. ATCsystems usually have at least one base sta tion at each airport. It is probablya good guess th at th e next generation mobile phones will have built-in G PS,where the opera tors’ base stations act as GPS base stations as well.

Long term averaging can bring down the accuracy t o decimeters, and byusing phase information of th e carrier between two measurement points wecome down to accuracy in te rms of millimeters.



Change detection based onlikelihood ratios

9.1.Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3439.2. Th e likelihood approa ch . . . . . . . . . . . . . . . . . . 46

9.2.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

9.2.2. Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

9.2.3. Likelihood ratio . . . . . . . . . . . . . . . . . . . . . . . . 348

9.3. The GLR test . . . . . . . . . . . . . . . . . . . . . . . . 349

9.4. The MLR test . . . . . . . . . . . . . . . . . . . . . . . . 353

9.4.1. Relation between GLR and MLR . . . . . . . . . . . . . . 353

9.4.2. A two-filter implementation . . . . . . . . . . . . . . . . . 355

9.4.3. Marginalization of th e noise evel . . . . . . . . . . . . . . 361

9.4.4. State and variance ump . . . . . . . . . . . . . . . . . . . 363

9.4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

9.5. Simulation study . . . . . . . . . . . . . . . . . . . . . . . 365

9.5.1.A Monte Carlo imulation . . . . . . . . . . . . . . . . . . 365

9.5.2. Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 369

9.A. Derivation of the GLR test . . . . . . . . . . . . . . . . 70

9 . A . l . Regression model for the ump . . . . . . . . . . . . . . . 370

9 .A.2 . The GLR est . . . . . . . . . . . . . . . . . . . . . . . . . 3719.B. LS-based derivation of the MLR test . . . . . . . . . . 72

9.1. Basics

This chapter is devoted to the problem of detecting additive abrupt changesin linear sta te space models . Sensor and actua tor faults as a sudden offsetor drift can all be modeled as additive changes . In addition. disturbances aretraditionally modeled as additive sta te changes . Th e likelihood ratio formula-tion provides a general framework for detecting such changes. an d to isolatethe fault/disturbance





9.1 Basics 345

High gain (Q: = 6t--kaI)I4

U t 21, P1Y t Filter

N o gain (Q: = 0)Hyp. test

P , PW

t

Y t FilterP O , PO

Figure 9.1. Two parallel filters. One is based on the hypothesis no change ( H o )and the

other on a change at time k (H1 k ) ) . By including more hypothesis for k , a filter bank isobtained.

Here we have split the Kalman filter quantities as

so the covariance matrix of th e change (fault component) is P/'. Note that

K 0 before the change. The following alternatives directly appear:Kalman filter-based adaptive filtering, where the st ate noise covariance

is used to track 19.

Whiteness-based residual test, where the Kalman filter innovations are

used and the sta te covariance block Q, is momentarily increased whena change is detected.

A parallel filter structure as in Figure 9.1. Th e hypothesis test can beaccomplished by one of the distance measures in Chapter 6.

This chapter is devoted t o customized approaches for detecting and ex-plicitely estimating the change v. The approach is based on Likelihood Ratio( L R )tests using Generalized Likelihood Ratio ( G L R )or Marginalized Likeli-hood Ratio ( M L R ) .The derivation of LR is straightforward from 9.3), but thespecial structure of the sta te space model can be used to derive lower orderfilters. Th e basic idea is th at th e residuals from a Kalman filter, assuming nochange, can be expressed as a linear regression.



346 Chanae detection based on likelihood atios

Linear regression formulationThe nominal Kalman ilter, assuming no ab rupt change, is applied,and the additive change is expressed as a linear egression with th e

innovations as measurements with the following notation:

Kalman filter Q - l , E t

Auxiliary recursion pt, ptResidual regression E t p?. etCompensation xt X t l t - l +P P .

The third equation indicates that we can use RLS to estimate th e changev ,

nd the fourth equation shows how to solve the compensationproblem afterdetection of change and es timation (isolation) of v.Chapter 10 gives an alterna tive approach to this roblem, where the change

is not explicitely parameterized.

9.2. The ikelihood approach

Some modifications of the Kalman ilter equations are given, and the ikelihoodratio is defined for the problem at hand.

9 2 1 Notation

The Kalman filter equations for a change v E N(0, Pv) at a given time k followsdirectly from 8.34)- 8.37) by considering v as an extra state oise componentV t 6t-,p, with Qt 6t-kPv.

Th e addressed problem is to modify these equations to the case where k andv are unknown. The change instant k is of primary interest, but good st at eestimates may also be desired.



9.2 The likelihood amro ach 347

In GLR, v is an unknown constan t, while it is considered as a stochasticvariable in the MLR test. To start wi th, the change will be assumed t o haveaGaussian prior. Later on, anon-informative prior will be used which s

sometimes called a prior of ignorance; see Lehmann (1991). Th is prior ischaracterized by a constant density function, p(v) = C.

Example 9.1 Modeling a chang e in the mean

We can use Eq. (9.1) to detect ab rupt changes in the mean of a sequence ofstochastic variables by letting At 1, C, 1, Qt 0, Bu,t 0. Furthermore,if the mean before the change is supposed to be 0, a case often consideredin the literature (see Basseville and Nikiforov (1993)), we have zo 0 and

no = 0.

It is worth mentioning, th at parametric models from Part I11 can fit thisframework as well.

Example 9.2 Modeling a chang e in an ARX model

By letting At = I and Ct = (yt-1, yt-2, . . . ,ut-1, ut-2, . . . ), a special caseof equation (9.1) is obtained. We then have a linear regression description ofan ARX model, where xt is the (time-varying) parameter vector and C, theregressors. In this way, we can detect ab rupt changes in the transfer functionof ARX models. Note tha t the change occurs in th e dynamicsof the systemin this case, and not in the system’s state.

9.2.2. likelihood

The likelihood for the measurements up to time given the change v at timek is denoted p ( y N I k ,v). The same notation s used for the conditional densityfunction for y N , given k ,v. For simplicity, k N is agreed to mean no change.There are two principally different possibilities to estimate th e change time L:

Joint ML estimate of k and v,

Here arg m a ~ k ~ [ ~ , ~ l , ~ p ( y ~ I k ,) means the maximizing argumen ts of the

likelihood p ( y N I k , ) where k is restricted to [ l , NI .




The ML estimate of just k using marginalization of the conditional den-sity function p ( y N I k ,):

The likelihood for data given just k in (9.6) is the sta rting point in this ap-proach.

A tool in the derivations is the so-called flat prio r, of the form p(v) C,which is not a proper density function. See Section 7.3.3 for a discussion and

two examples for the parametric case, whose conclusions are applicable hereas well.

9.2.3. likelihood ratio

In the context of hypothesis testing, the likelihood rati os rather than the ikeli-hoods are used. The LR test is a multiple hypotheses test, where the differentchange hypotheses are compared to th e no change hypothesis pairwise. Inthe LR test, the change magnitude is assumed to be known. The hypotheses

under consideration are

H0 : no hange

H l ( k ,v a change of magnitude v at time k .

The test is as follows. Introduce the log likelihood ratio for the hypotheses asthe test statistic:

The factor 2 is just for notational convenience. We use the convention thatH 1(N , v Ho, so again, k N means no change. Then the LR estimate canbe expressed as

when v is known. Exactly as in (9.5) and (9.7), we have two possibilities ofhow to eliminate the unknown nuisance parameter v. Double maximizationgives the GLR test, proposed for change detection in Willsky and Jones (1976),and marginalization the MLR test, proposed in Gustafsson (1996).



9.3 The GLR test 349

9.3. The GLR test

Why not jus t use the augmen ted stat e pace model (9.3) and the Kalman ilter

equations in (9.4)? It would be straightforward to evaluate the likelihood ratiosin (9.8) for each possible k . The answer is as follows:

The GLR algorithm is mainly a computational tool that splitsthe Kalman filter for th e full order model (9.3) into a low orderKalman filter (which is perhaps already designed and running)and a cascade coupled filter bank with least squares filters. I

The GLR test proposed in Willsky and Jones (1976) utilizes this approach.GLR’s general applicability has contributed to it now being a sta ndard toolin change detection. As summarized in Kerr (1987), GLR has an appealinganalytic framework, is widely understood by many researchers and is readilyapplicable to systems already utilizing a Kalman filter. Another advantagewith G LR is th at it partially solves th e isolation problem in fault detection,i.e. to locate t he physical cause of th e change. In Kerr (1987), a number ofdrawbacks with GLR is pointed out as well. Among these, we mention prob-lems with choosing decision thresholds, and for some applications an untenable

computational burden.The use of likelihood ratios n hypothesis esting is motivated by the

Neyrnan-Pearson Lem ma;see, for instance, Theorem 3.1 in Lehmann (1991).In the application considered here, it says th at the likelihood ratio is the op-timal test statistic when the change magnitude is known and just one changetime is considered. This is not the case here, but a sub-optimal extension isimmediate: the test is computed for each possible change time, or a restrictionto a sliding window, and if several tests indicate a change the most significantis taken as the estimated change time. In GLR, the actua l hange in the sta te

of a linear system is estimated from da ta an d then sed in th e likelihood ratio.Starting with th e likelihood ratio in (9.8), the GLR

mization over k and v ,

test is a double maxi-

where D k ) is the maximum likelihood estimate of v , given a change at timek . The change candidate i n the GLR test is accepted if

Z~ i,fi i)) h. (9.10)



350 Chanae detection basednikelihoodatios

The threshold h characterizes a hypothesis test and distinguishes the GLRtest from the ML method (9.5). Note tha t (9.5) is a special case of (9.10),where h = 0. If the zero-change hypothesis is rejected, the state estimate can

easily be compensated for the detected change.The idea in the implementation of GLR in Willsky and Jones (1976) is to

make the dependence on v explicit. This task is solved in Appendix 9.A. Th ekey point is tha t th e innovations from th e Kalman filter (9.4) with k N canbe expressed as a linear regression in v ,

where E t ( k )are the innovations from the Kalman filter if v and k were known.

Here and in the sequel, non-indexed quantities as E t are the outpu t fromthe nominal Kalman filter, assuming no change. The GLR algorithm can beimplemented as follows.

Algorithm 9 7 GLR

Given the signal model (9.1):

Calculate the nnovations from the Kalman filter (9.4) assuming no change.

Com pute the regressors cpt(k)using

initialized by zeros at time t = k ; see Lemma 9.7. Here p t is n, X 1 andis n, X n,.

Com pute the linear regression quantities

for each k , 1 5 k 5 t .

At time t N , the test stat istic is given by



9.3 The GLR test 351

A change candidate is given by k = arg max 1 ~ ( k ,( k ) ) . It is acceptedif Z~ i,fi i))s greater than some threshold h (otherwise k N ) andthe corresponding estimate of th e change magnitude is given by C N ~ )

n,l(i)fN(i).

We now make some comments on the algorithm:

It can be shown that the test statistic ZN(L,. L)) under th e null hypoth-esis is x 2 distributed. Thus, given the confidence level on the test , thethreshold h can be found from standard statistica l tables. Note t ha t th isis a multiple hypothesis test performed for each k = 1 ,2 , . . . ,N 1, sonothing can be said about the total confidence level.

The regressor p t k ) is called a failure signature matrix in Willsky andJones (1976).

Th e regressors are pre-computable. Furthermore, if the system and theKalman filter are time-invariant, t he regressor is only a function of t ,which simplifies the calculations.

Th e formulation n Algorithm 9.1 is off-line. Since the es t statisticinvolves a matrix inversion of R N , a more efficient on-line method is asfollows. From (9.34) and (9.37) we get

W f ? k ) C t k ) ,

where t is used as time index instead of N . The Recursive Least SquaresRLS) scheme (see Algorithm 5.3), can now be used to upda te C t ( k )

recursively, eliminating the ma trix inversion of Rt(k) . Thus, the bestimplementation requires t parallel RLS schemes and one Kalman filter.

Th e choice of threshold is difficult. I t depends not only upon the system'ssignal-to-noise ratio, but also on the actual noise levels, as will be pointed outin Section 9.4.3.

Example 9.3 DC motor: the GLR test

Consider the DC motor in Example 8.4. Assume impulsive additive statechanges at times 60, 80, 100 and 120. Fi rs t he angle is increased by iveunits, and then decreased again. Then the same fault is simulated on angularvelocity. That is,

= 3 v2 = (;5) , v 3 = 3 v 4 = (_os).




Test statistic and threshold30

20

1 0

-

1 I 1

‘0 d o G , i kl;6G\a, 60

“ 50 2000 80 100 120 140 160

KF estimateGLR estimate

Q

‘0 2000 80 100 120 140 160

Figure 9.2. Test statistic max-L<k<t l t ( k ) for change detection on a simulated DC motor(first) and state estimates from GLR test and the Kalma n filter (second and third). Note,in particu lar, the improved angle tracking of GLR.

Figure 9.2(a) shows how th e maximum value maxt-L<k<tZt(k) of the es tstatistics evolves in time, and how it exceeds the threshold level h = 10 fourtimes. Th e delay for detection is three samples for angular change and fivesamples for velocity change.

The GLR state estimate adapts o the tru e sta te as hown in Figure 9.2(b).The Kalman filter also comes back t o the tru e stat e, bu t much more slowly.The change identification is not very reliable:

Compared to th e simulated changes, these look ike random numbers. Th eexplanation is that detection is so fast th at there are too few da ta for faultestimation. To get good isolations, we have to wait and get considerably moredata. The incorrect compensation explains the short transients we can see inthe angular velocity estimate.

Navigation examples and references to such are presented in Kerr (1987).As a non-standard application, GLR is applied to noise suppression in imageprocessing in Hong and Brzakovic (1980).



9.4 The MLRtest 353

9.4. The MLR test

Another alternative is to consider the change magnitude as a stochastic nui-

sance parameter. This is then eliminated not by estimation, but by marginal-ixation. Marginalization is wellknown in estimation theory, and is also used inother detection problems; see, for instance, Wald (1950). Th e resulting testwill be called the Marginalixed Likelihood Ratio ( M L R )test . The MLR testapplies to all cases where GLR does, bu t we point out three advantages withusing the former:

Tuning. Unlike GLR, there is no sensitive threshold to choose in MLR.One interpretation is that a reasonable threshold in GLR is chosen au-

tomatically.

Ro bus tne ss to modeling errors.The performance of GLR deteriorates inth e case of incorrectly chosen noise variances. The noise level in MLR isallowed to be considered as another unknown nuisance parameter. Thisapproach increases the robustness of MLR.

Complexity.GLR requires a linearly increasing number f parallel filters.An approximation involving a sliding window technique is proposed in

Willsky and Jones (1976) to ob ta in constant number of filters, typicallyequivalent to 10-20 parallel fi lters. For off-line processing, the MLR testcan be computed exactly from only two filters. This implementation isof particularly great impact in the design step. Here the false alarm rate ,robustness properties and detectability of different changes can be eval-uated quickly using Monte-Carlo simulations. In fact, the computatio nof one single exact GLR test for a realistic dat a size (> 1000) is alreadyfar from inter-active.

9.4.1. Relation between GLR and MLR

In Appendix 9.B the MLR test is derived using the quantities from the GLRtest in Algorithm 9.1. This derivation gives a nice relationship between GLRand MLR. In fact, they coincide for a certain choice of threshold.

Lemma 9.1I f (9.1) is time invaxiant and v is unknown, then the GLR test in Algorithm9.1 gives the same estimated change time as the MLR test in Theorem 9.8 as

N k + and k + i f he threshold is chosen as

h = p log(27~) + log det RN k ) y ( i . )




whe n the p rio r o f t he jum p i sE N(vo, P,), and

h = log det & ( k )

for a flat prior. HereR N ( ~ )liml\r-k+oo,k+ooRN(k), and & ( k ) is definedin Algor i thm9.1.

Proof: In the MLR test a change k is detected if Z N ~ )> ZN(N) 0 and inthe GLR if 1 ~ ( k ,k)) > h. From Theorem 9.8 we have Z N ~ )= Z ~ l c ,v k)) +2 logp,(i/) -log det R N ( ~ )plog(27r). Lemma 9.9 shows tha t & ( k ) convergesas N + 00 and so does log det R N ( ~ ) . ince (9.1) is restricted t o be timeinvariant the terms of & ( k ) th at depend on th e system matrices and theKalman ga in are the same independen tly of k as k + 00 according to (9.28).

Note that (3.48) follows with p 1 and R k ) Cf lTRP11 ( tk ) R *

The threshold is thus automatically included in the MLR test. If we wantMLR to mimic a GLR test, we can of course include an external thresholdh M L R = h G L R + 2logp,(i.) plog(27r) ogdet & ( k ) . In th at case, weaccept the change hypothesis only if Z N ~ )> h M L R . The external thresholdcan also be included in an ad hoc manner to tune th e false alarm rate versus

probability of correct detection.We now make a new derivation of the MLR test in a direct way using a

linearly increasing number of Kalman filters. This derivation enables first theefficient implementation in the Section 9.4.2, an d secondly, th e elimination ofnoise scalings in Section 9.4.3. Since the magnitudes of the likelihoods tu rnout to be f completely different orders, the og likelihood will be used in orderto avoid possible numerical problems.

Theorem 9.2

Consider the signal model(9.1)’ where th e covar iance m atr ixof th e Gaussiandis t r ibu ted jump magni tude i sP,. For each k 1 ,2 , . . . , , upda te the k ’ thKalman f i l te r in (9.4). T h e log l ike lihood, condi tioned on a ju m p a t t im ek ,can be recursively computedby

logp(yt1k) = logp(yt-lIk) og 27r2

(9.11)

12

ogdet St (k ) - ~ ~ k ) S ~ ~ k ) ~ t k ) .

Proof: By Bayes’ rule we have




Theorem 9.3Consider the signal model (9.1) for the case o f an inver tibleAt. T h e l ike li -hood for the measurements condi tioned on a ju m p a t t im el and the last Lmeasurem ents, can be computed by two Kalman f i l ters as follows. First , thelikelihoods axe separated,

T h e l ikelihoods involved are computed by

Here ? X p P ) s the Gaussian probabi l i ty densi ty funct ion. The quant i t iesctplnd PGpl axe given b y the K alma n f i l te rapplied to the forward mo deland P +, and P$+l axe given by the K alman f i l te r appl ied on the backw ardmode l (9.12). Theuant i t ies and P N used forormalizationregiven b y the Kalm an f i l te r appl ied on the?-l rwaxd model init ia ted at t imet = N L + 1 with PN-L+~IN-L = I I N - L + ~ .

Proof: Bayes' law gives

(9.17)

(9.18)



9.4 The MLRtest 357

The fact that the jump at time k does not affect the measurements beforetime k (by causality) is used in th e last equality, so p(y 1k) p(&>. Here, theinfinite variance jump makes th e measurements after th e ju mp ndependent of

those before.Th e likelihood for a set yk can be expanded either forwards or backwards

using Bayes' chain rule:

t=m

Now p ( y N I k N ) and p(yk) are computed using the forward recursion (9.21),and since xt is Gaussian, it follows immediately tha t ytlyt-' is Gaussian withmean Ct2 l and covariance C -lCF + Rt, and (9.14) follows.

Also, p(y -,+,lk N ) s computed in th e same way; th e difference is th at

the Kalman filter is initiated at time N L + 1. Finally, p(yr<Lly -L+l, k )is computed using (9.22) where ytlygl is Gaussian with mean Cti?:t+l and

covariance CtP +lCr Rt and (9.16) follows. 0

As can be seen, all th a t is needed to compute he likelihoods are oneKalman filter running backwards in time, one running forwards in time, andone processing the normalizing data at the end. The resulting algorithm isas follows, where th e log likelihoods are used because of possible numericalproblems caused by very large differences in the magnitude of the likelihoods.The notation introduced here will be used in the sequel.



9.4 The MLRtest 359

Backward information filter or t = N , N 1, . . ,N L + 1:

Backward Kalman filter for t = N L , N L 1, . . , l :

PN-LIN-L+lfrom backward information filter

Z N - L I N - L S l =P-LIN-L+laN-LIN-L+lB - B

V B ( N L + 1) =oD B ( N L + 1) =o

E t =yt ctqt,, Du,tUt

S,B =CtP$+lCT +Rt

%lit t q t 1- B A-1-B + A , Ptlt+1Ct(St ) E t Bu,t-lUt-l

B B -1 B

=At' .,p,+, P$+l cT(s,B)lctP$+l) A t T

+V B ( t ) V B @ 1) + (E;)T(S:)-l&FD B ( t ) D B ( t+ 1) + logdet SF

Compute the likelihood ratios

ZN(k)= V F ( N ) V N ( N ) V F ( k ) V B ( k + 1) +D F ( N ) D N ( N ) D F ( k ) DB k + 1)




We make some remarks on the algorithm:

The normalization filter and the backward information filter play a mi-nor role for the likelihood, and might be omi tted in an approximatealgorithm.

The jump magnitude, conditioned on th e ju mp time, can be estimatedfrom the available information using the signal model (9.1) an d the ilterstate estimates:

This is an over-determined system of equations.

The a posteriori probability of k is easily computed by using Bayes' law.Assuming p ( k ) C,

This means tha t the a posteriori probability of a wrong decision can becomputed as 1 ( k l y N ) .

The relation to fixed-interval smoothing is as follows. The smoothedestimates under the no jump hypothesis can be computed by

Here 2 and P$ are the filtered est imates from the orward filter (these

are not given explicitly above).

If the d ata a re collected in batches, t he two-filter algorithm can be ap-plied after each batch saving computation time.

It should be stressed for the the orem that it s necessary for this two-filter im-plementation that the jump is considered as stochastic with infinite variance,which implies the important separability possibility (9.13). If not, the theoremwill provide a sub-optimal algorithm with good properties in practice.

A elated two-filter idea is found n Niedzwiecki (1991), where a sub-optimal two-filter detection algorithm is proposed for detecting changes inthe parameters of a finite impulse response model.



9.4 The MLRtest 361

9.4.3. Marginalization of the noise level

Introduction

Knowledge of th e covariance matrices is crucial for th e performance of model-based detectors. The amount of prior information can be substantially relaxed,by eliminating unknown scalings from th e covariance matrices in th e st atespace model (9.1):

Here the matrices without a bar are chosen by t he user, and those with barare the ‘true’ ones or at least give good performance. This means th a t onechooses the tracking ability of the filter, which is known to be insensitiveto scalings; see Anderson and Moore (1979). The estimator then estimatesthe ac tual level, which is decisive for the likelihoods. The assumption (9.23)implies Pt XPt and 3 AS,, and from (9.36) it follows that

For the GLR test, this implies th a t if all covariance matrices are scaled afactor X, then the optimal threshold should be scaled a factor X as well. Thus,it is the ratio between the noise variance a nd the threshold t ha t determinesthe detect ion ability in GLR. In this sense, th e problem formulation is over-parameterized, since both X and the threshold have to be chosen by the user.

Equation (9.23) is an interesting assumption from the practical point ofview. Scaling does not influence the filtered sta te est imates. It is relativelyeasy for the user to tune the trackin g bility, bu t the sizes of t he covariances areharder to judge. The robustness t o unknown scalings is one of t he advantageswith the MLR test, as will be shown.

Another point is that a changing measurement noise variance is known tocause problems to many proposed detection algorithms. This is trea ted hereby allowing the noise covariance scaling to be abrupt ly changing.

A summary of the MLRs is given in Section 9.4.5.

State ump

In this section, the two filter detector in Theorem 9.3 will be derived for theunknown scaling assumption in (9.23). If At is not invertible as assumed inTheorem 9.3, the direct implementation of MLR in Theorem 9.2 can also bemodified in the same way. The following theorem is th e counterpart to th etwo filter detection method in Algorithm 9.2, for the case of an unknown X.




Theorem 9.4Consider the signal model (9.1) in the case of an unknown noise variance,and suppose that (9.23) holds. With notation as in the two filter detector inAlgorithm 9.2, the log likelihood ratios are given by

Z N ( k )= ( N p 2 ) (log(VF(N) V N ( N ) )

og(VF(k) + VB(k + 1 ) ) )

+ D F ( N ) D N ( N ) D F ( k ) DB k + 1)

with a flat prior on X.

Proof: The flat prior on X here means p(Xly;-,+,) = C , corresponding tobeing completely unknown. By marginalization, if k < N L we get

N - L NP(Y IYN-L+17 k ,

= Jum P(Y I Y N - L + ~ ,) P ( X I Y; - L + ~ ) ~ X- L N

v)VF(k) + v ( k ) ) - W .

The gamma function is defined by r ( a ) = O0 ePtta- 'dt . The last equalityfollows by recognizing the integrand as a density function, namely the inverseWishart distribution

(9.24)



9.4 The MLRtest 363

which integrates t o one. In th e same way, we have for k = N


Here we note that the a posteriori distribution for X, given th e j um p in-stant k , is W- ' ( N p , V F ( k )+VB (k)), where W-' denote s the inverse Wishartdistribution (9.24).

9.4.4. State and variance jumpIn this section, the likelihood is given for the case when the noise variance isdifferent before and after the jump. This esult is of great practical relevance,since variance changes are very common in real signals.

Theorem 9.5Consider the same detection problem as in Theorem 9.4, but with a noisevariance changing at the jump ins tan t,

= {X 1 , i f t I kX2 , f k < t 5 N.

With notation as in Algorithm 9.2, the log likelihood ratios for 211, < k <N 2/p are given by

Z N ~ )= ( N p 2) log(VF(N) VN (N ))

( k p 2) log VF(k) (Np kp 2) log VB(k + 1 ) )

+ D F ( N ) DN(N) D F ( k ) D B ( k+ 1)

if the prior on X is Aat.

Proof: Th e proof resembles th a t of Theorem 9.4. Th e difference is th at th eintegral over X is split into two integrals,

P(Y I Y N - L + ~ , ) =P Yk)p YF+I+IYNNL+, k )- L N

= 1 (YkI~1)P(X1IYNN-~+')dX1

. S p y F + l l Y L + l ,L, X2)P(X2IYL+l)dX2. (9.25)

Eachntegral is evaluated xactly as inheorem 9.4. 0




Remark 9.6For this particulax prior onX, the integrals in 9.25), and th us also th e like-lihood, are not defined fork p < 2 and N p p < 2. Th is is logical, becausetoo little data are available t o evaluate t h e noise variance.

In this case the a posteriori distribution for XI given the jump instant L,is W - ' ( k p , V F ( k ) )nd for X2 it is W- l ( ( N ) p ,VB(k)).

9.4.5. Summary

We can conveniently summarize th e resu lts n he hree different cases asfollows: The MLR s in Theorems 9.4, 9.5 and Algorithm 9.2 are given by

in the cases known lambda (i), unknown constant lambda (ii) and unknownchanging lambda (iii), respectively.

The model and data dependent quantities V(k) and D ( k ) are all givenby th e two filter Algorithm 9.2. The decision for which likelihood is t obe used can be deferred until after these quantities are computed. Inparticular, all three possibilities can be examined without much ex tracomputations.

The dominating terms are VF(k) and VB(k). When the noise is un-known, ( V F ( k )+ VB(k))/X is essentially replaced by Nlog(VF(k) +V B ( k ) )nd k log V F ( k )+ N ) og VB (k), respectively. This leads toa more cautious estimator that diminishes the influence of the innova-tions.

The term D F ( N ) D N ( N ) D F ( k ) - D B ( k + l ) appears in all likelihoodratios. It is positive and does not vary much for different jump instantsk = 1 , 2 , . . . ,N 1. This term corresponds to th e hreshold in the GLRtest.

These methods are compared in the following section.



9.5 Simulation studv 365

9.5. Simulation study

The simulation study investigates performance, robustness and sensitivity for

GLR and the three different MLR variants for a first order motion model.Th e applicability of GLR is wellknown, as mentioned in the introduction.The MLR uses the same model and, thus, can be pplied to th e ame problemsas GLR. Therefore, the purpose of t he current section is t o show what can begained in robustness and computational burden y using MLR instead of GLRillustrated by a quite simple example. A quite short data length will be used,which allows us t o compute the exact GLR test.

A sampled double integrator will be examined in this comparative studyof the different methods. For instance, it can be thought of as a model for the

position of an object influenced by a random force. The sta te space model forsample interval 1 is

where

The jump change corresponds to a sudden force disturbance, caused by a ma-noeuvre, for example. All filters are initialized assuming illo N(0,lOOOXI)and the number of measurements is N . Default values are given by

v =(5, 1 0 y

k =25

N =50

X = l

R = l .

The default values on X and R are used to compute the change detectors. T hedetectors are kept the same in all cases, and the data generation is varied inorder t o examine the robustness properties.

9 5 1 A Monte Carlo simulation

Th e following detection methods are compared:




GLR in Algorithm 9.1 and using a sliding window of size 10 (referred t oas GLR( 10)).

MLR in Algorithm 9.2 for an assumed known scaling X 1 (MLR l ) ,Theorem 9.4 for unknown scaling (MLR 2) and Theorem 9.5 for unknownand changing scaling (MLR 3), respectively.

For the two-filter methods MLR 1-3, five ex tra da ta po ints were simulatedand used for initialization. Table 9.1 shows the ala rm rates for no jump andjump, respectively, while Table 9.2 shows the estimated jump time n the casesof a jump. The left column indicates what has been changed from the perfectmodeling case.

We note t hat t he sliding window approximation of GLR is indistinguish-able from th e exact implementation. The ala rm rates of GLR for th e perfectmodeling case are slightly larger th an for MLR with the chosen threshold.Taking this fact into account, there is no significant difference between GLRand MLR 1. MLR 2 and 3 give somewhat larger false alarm rates and smallerdetection probabilities in th e perfect modeling case. This is no surprise as lessprior information is used, which becomes apparent in these short da ta sets.The cases where X < 1 or R < 1, both implying that the measurement noisevariance is smaller than expected, cause no problems for GLR and MLR 1.

Note, however, how MLR 2 and 3 take advantage of this situation, and herecan detect smaller changes th an can GLR and MLR. In th e case X = 0.01 andv [2; 41 the probability of detection is 50%, while MLR 1 only detects 2% ofthese changes.

Th e real problem is, of course, th e cases where th e measurement noisevariance is larger tha n modeled. Here the false alarm rate for GLR and MLR1 is close to one. On the othe r hand, MLR 2 has a very small and MLR 3 afairly small false alarm rate. Of course, it becomes harder t o detect the fixedsize change tha t is hidden in large noise.

The cases of a suddenly increasing noise scaling is excellently handled byMLR 2 and 3. Th e former gives no alarm, because this kind of change is notincluded in the model, and the latt er qui te correctly estimates a change attime 25.

We can also illustrate the difference in performance by plotting the averagelog likelihood ratios 210gp(yNIk, ’ /p(yNIk = N ) and 210gp(yNIk)/p(yNIk =N ) , espectively, as a function of change time k 1 ,2 , . . . N . This is done inFigure 9.3 for GLR and MLR 1,2,3. A change is detected if the peak value ofthe log likelihood ratio is larger than zero for MLR and larger than h 6 forGLR. Remember that the GLR log likelihood ratio is always positive.

Th e first plot shows the perfect modeling case, an d the peak values arewell above the respective thresholds. Note that MLR 1 and GLR are very



9.5 Simulation studv 367

Table 9.1. Alarm rate for 1000 Monte Carlo simulations for different cases of modelingerrors. In each case, a sta te change [5;10] and no change are compared. Sensitivity toincorrect assu mpt ions on the noise variances is investigated, where X deno tes the scaling of

Q and R.Case

l.011= 100l.15100 times increase in X, changel.14100 times increase in Xl.14 0.9910 times increase in X, change

0.99 0.10 0.9710 times increase in X0.97.91.961erfect modeling, change at t = 100.91.88.941erfect modeling, change at t = 400.04.01.02 0.10 0.10erfect modeling, change [l;2 ]0.95 0.92 0.97.99 0.99erfect modeling, change0.04.01 0.01.08 0.08erfect modeling

MLR 3LR 2 MLRGLR(10)LR

X = 100, change 1 1 1 1 l I 0.01 I lX = 10 1 l I 0.01 I 0.43

similar in their shape except for a constant offset as stated already in Lemma9.1.

Th e second plot illustra tes what happens after an abrupt change in noisescaling. GLR and MLR 1 become arge for all t > 25 and he estimatedchange times are distributed over the interval [25,50]. MLR 2, which assumesan unknown and constant scaling, handles this case excellently without anypeak, while MLR 3 quite correctly has a peak at t 25 where a change inscaling is accurately estimated.




Table 9.2. Estimated change time for 1000 Monte Carlo simulations for different cases of

modeling errors.

1

lo; 5 10 1'5 20 25 30 35 40 45 d0

Jumpl ime

( 4 (b)

Figure 9.3. Log likelihood ratios for GLR (solid) and MLR 1 2 3 (dashed, dashed-dottedand dotted, respect ively) averaged over 100 realizations. Plot for perfect modeling and achange (a), and plot for the case of no ch ange but the noise variance changes abruptly from1 to 100 at time 25 (b).




9.A. Derivation of the GLR test

9.A.1. Regression model for the jump

First, process the measurements through a Kalman filter acting under thehypothesis of no jump. With notation as in (9.4), denote its state estimate,gain, prediction error and prediction error covariance by

Next, suppose that ther e was a jump v at time k . Due to the linear model,the dependence of v on the state estimates and innovations will be linear aswell. That is, we can postu late the following model:

(9.26)

(9.27)

Here ? t l t (k ) and ~ t ( k )re he quantit ies one would have obtained from aKalman filter applied under the assumption that we have a jump at time k .These Kalman filters will, however, not be used explicitly, and th at is the keypoint. Again, we follow th e convention that k is equal to th e final time, heret , means no jump. We have th e following update formulas for the n, X n,matrix , u t (k )and the n, X ny matrix cpt(k), first given in Willsky and Jones(1976).

Lemma 9.7Consider he res iduals rom he Kalma n i l te r appl ied o he model 9.11,assuming no jum p. The re la t ion be tween the res iduals f rom a Kalm an f i l te rcondit ioned on a jum p of m ag ni t ud ev a t t ime k , a nd f r om a Ka lma n fi l te rcondi tioned on no ju m p i sgiven by the l inear regression

E t @ ) = E t + cpF(k)Be,tv,

where ~ t ( k )s a white noise sequence with variancet. Here cpt(k) is computedrecursively by

(9.28)

Pt+ lW =AtPt W + Kt+lcpF+,(k), (9.29)wit h th e initial conditionsp k ( k ) 0 and c p k ( k ) 0. Here Kt i s t he Ka lma ngain a t t ime t .



9.A Derivation of the GLR test 371

Proof: Th e result is proved by induction. The nitial condition is a directconsequence of th e signal model (9.1). The induct ion step follows from theKalman equations, (9.26) and (9.27). Firs t we have th e relation z t k ) =

xt +nkk AiB0,tv. This and equations (9.26) and (9.27) give

v t+ 1Be , t v =, +l k ) & t + lT

= G + 1 ( Z t + l ( k ) Q + l ) C t + l A t (?tl t(k) %It)

and

P t + l ( k ) B e , t v= & + l l t + l W t + l l t + l

=At ( Q t ( k ) Q t ) +Kt+l ( E t + l + cpt+1Be,tv t + l )

=Atpt(k)Be,tv + Kt+lcpF+lBe,tv.

T

Since this holds for all Be,tv, the result follows y induction. 0

The detection problem is thus moved from a state space to a linear regres-sion framework.

9.A.2. The GLR testWe begin by deriving the classical GLR test, first given in Willsky and Jones(1976), where the jump magnitude is considered as deterministic. Lemma 5.2gives

That is, the measurements and the esiduals have th e same probability densityfunction. Given a jump v at time k , th e residuals get a bias,

Thus, the test statis tic (9.8) can be written as

Introduce the well known compact quantities of the LS estimator

(9.31)

(9.32)

(9.33)




9.35)

N

t = k + 1

E t p 3 + w st E t p m w ) )-1 9.36)

= f ;F k ) R ~ ) - l k ) ~ iv k ) , 9.37)

where the second equality follows from 9.30) and the Gaussian probabilitydensity function. The third equality follows from straightforward calculationsusing 9.34), 9.32) and 9.33) . Thi s simple expression for th e test statistic isan appealing property of the GLR test.

9.B. LS-based derivation of the MLR est

The idea here is to consider the jump magni tude as a stochastic variable. In

this way, the maximization over v in Z ~ k , v )9.8) can be avoided. Instead,the jump magni tude is eliminated by integration,

P(&?> S P E ? I V ) Y W.In the context of likelihood ratio tests, the possibility of integrating out thenuisance parameter v is also discussed in Wald 1950). Here, there are twochoices. Either v is assumed to have a Gaussian prior, or it is considered tohave infinite variance, so the (improper) prior is constant. The log likelihoodratio is given in the following theorem.

Theorem 9.8Cons ide r t he GLR t e s tn Algor i thm9.1 as an estimator o f a change of m ag -ni tude v a t t ime k , k = 1 , 2 , .. . ,N 1, for the signal model(9.1), given themeasurementsyN. Th e LR , o rr espond ing to the e s t im a to r o falone, is givenbY

P(YNIk 4argmax 2 log og det R; ( k )+ Cprior k )v P YNIk N )



9.B LS-based derivation of the MLR test 373

where Z N ~ ,D ( k ) )= f ; ( R g ) - ’ j ~ s given in Algorithm 9.1. Here Cp,io,(k) isa prior dependent constant that equals

if the prior is chosen to be non-informative, and

C,,i,,(k)Y 2 logp,(fi) log 27r 9.39)

if the prior is chosen to be Gaussian, v E N(vo,P,). Here R g ( k )and f i ~ ( k )axe given by 9.33) and 9.341, respectively.

Proof: We begin with the Gaussian case,

In the second equality the off-line expressions 5.96) and 5.97) are used.

(5.96) and (5.97) are used:The case of non-informative prior (P;’ 0) is proved as above. Again,

N= &TS,-l&t &t PT k)fiN k))TS,-l Et & (k ) f iN(k ) )

t = k + l

+ log det PN k )= Z N ~ ,;(/c)) ogdet R G ( k ) .

Here the fact P N ( ~ ) ( R g ) - ’ ( k ) is used. 0

The conclusion is th at th e ML estimates ( k ,v ) and are closely related. Infac t, the likelihood ratios are asymptotically equivalent except for a constant.This constant can be interpreted as different thresholds, as done in Lemma9.1.




In this constant, the term ( 5 ~V O ) ~ P ; ’ ~ N o) is negligible if the prioruncertainty P, is large or, since 5 ~ ( k )onverges to vo, f N is large. Here5~ k ) v0 as N k + 00 because the Kalman filter eventually racks

the abrup t change in the state vector. As demonstrated in in th e simulationsection, the te rm logde t R G ( k )does not al ter the likelihood significantly. I nfact, log det R g ( k )is asymptotically constant, and this is formally proved inthe following lemma.

Lemma 9.9Let cpt (k) be as given in Result 9.7. Then, if the signal model (9.1) is stableand time-invariant,

N

t = k + l

converges as N k tends to infinity.

Proof: Rewriting (9.29) using (9.28) gives

where K is the stationary Kalman gain n th e measurement update. Now

by assumption X 1 maxXi(A) < 1 so the Kalman filter theory gives thatX2 = max ( A AKC) < 1 as well; see Anderson and Moore (1979) p. 77(they define as F T here). Thus, KCAtPk s bounded and

This implies that

Ilcpt(k)ll < C2Xt-k,

where X max(X1, XZ). Now if > m

1

t =m+l

2 2(m+l-k) 1 x2ll - X

=c2 + 0 as m,Z 00.

Thus, we have proved that R & ( k ) s a Cauchy sequence and since it belongsto a completepacet converges. 0



9.B LS-based derivation of the MLR test 375

We have now proved that th e log likelihoods for the two variants of pa-rameter vectors ( k , ) and k are approximately equal for k l , 2,. . . ,N l ,except for an unknown constant. Thus,

Z N ( k , C N ( k ) ) Z N k ) +c.Therefore, they are likely to give the same ML estimate. Note, however, th atthis result does not hold for k N (that is no ju mp ), since ZN(N,CN(N))ZN(N) = 0. To get equivalence for this case as well, the threshold in the GLRtest has to be chosen to this unknown constant C.



10Change detection based on

multiple models

10.1. Basics . 37710.2. Examples of applications . . . . . . . . . . . . . . . . . . 37810.3. On-line algorithms . . . . . . . . . . . . . . . . . . . . . . 385

10.3.1. Genera l deas 385

10.3.2. Pruning algorithms 386

10.3.3. Merging trategies . . . . . . . . . . . . . . . . . . . . . . 387

10.3.4. A literatu re urvey 389

10.4. Off-line alg ori thm s. 39110.4.1. The EM algorithm . . . . . . . . . . . . . . . . . . . . . . 391

10.4.2. MCMC algorithms . . . . . . . . . . . . . . . . . . . . . . 392

10.5. Local pruning in blind equalization 39510.5.1. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

10.A.Posterior distribution . . . . . . . . . . . . . . . . . . . . 397

1O.A.l.Posterior distribution of th e continuous sta te 398

lO.A.2. Unknown noise level 400

10.1. Basics

This chapter addresses the most general problem formulation of detection inlinear systems. Basically, all problem formulations th at have been discussedso far are ncluded in the framework considered. The main purpose is tosurvey multiple model algorithms, and a secondary purpose is to overviewand compare the state of the art in different application areas for reducingcomplexity, where similar algorithms have been developed independently.

The goal is to detect abrup t changes in th e st at e space model

ZtS1 = A t ( ) z t + Bu,t(&)ut + Bu, t (&)v t (10.1)

Yt = C t ( & h + + etut E N ( m u , t ( ) ,Q t ( & ) )

et E N ( m e , t ( & ) , (Q).





378 Chanae detection based on multide models

Here St is a discrete parameter representing the mode of the system (linearizedmode, faulty mode etc .), and it takes on one of S different values (mostlywe have the case S = 2). This model incorporates all previously discussed

problems in this book, and is therefore th e most general formulation of the es-tima tion and detection problem. Section 10.2 gives a number of applications,including change detec tion and segmentation , but also model structure selec-tion, blind and stan dard equalization, missing da ta an d outliers. The commontheme in these examples is that there s an unknown discrete paramete r, mode,in a linear system.

One natural strategy for choosing a S is the following:

For each possible 6, filter the data through a Kalman filter for the (con-ditional) known sta te space model (10.1).

Choose the particu lar value of 6, whose Kalman filter gives the smallestprediction errors.

In fac t, this is basically how th e MAP estimato r

g M A P = arg rn?P(6lYN)10.2)

works, as will be proven in Theorem 10.1. The struc ture s illustrated in Figure10.1.

The key tool in this chapter is a repeated application of Bayes’ law tocompute a posteriori probabilities:

(10.3)

W Y t c t ( s t ) s t , t - l ( 6 t ) , R t ( 6 t ) + Ct(S,)P,,t-,(6,)C,T(s,)).

A proof is given in Section 10.A. The latter equation is recursive and suitablefor implementation. This recursion immediately eads to a multiple modelalgorithm summarized in Table 10.1. This table also serves as a summary ofthe chapter.

10.2. Examples of applications

A classical signal processing problem is to find a sinusoid in noise, where thephase, amplitude and frequency may change in ime. Multiple model ap-proaches are found in Caciotta and Carbone (1996) and Spanjaard and White



10.2 ExamDles of amlications 379

Table 10.1 . A generic multiple model algorithm.

1. Kalman filtering: conditioned on a particular sequence b t , the stateestimation problem in 10.1) is solved by a Kalman filter. This will becalled the conditional Kalman filter, and its outputs are

2. Mode evaluation: for each sequence, we can compute, up to an un-known scaling factor, the posterior probability given the measurements,

using 10.3) .

3. Distribution: at time t , there are St different sequences S t , which willbe labeled @ i), i = 1 , 2 , . St . It follows from the theorem of totalprobability th at th e exact posterior density of the sta te vector is

This distribution is a Gaussian mixture with St modes.

4. Pruning and merging on-line): for on-line applications, there aretwo approaches to app roxima te the Gaussian mixture, both aiming atremoving modes so only a fixed number of modes in he Gaussian mixture

are kept. The exponential growth can be interpreted as a tree, and theapproximation strategies are merging and pruning. Pruning is simply tocut off modes in the mixture with low probability. In merging, two ormore modes are replaced by one new Gaussian distribution.

5. Numerical search off-line): for off-line analysis, there are numericalapproaches based on th e EM algorithm or MCMC methods. We willdetail some suggestions for how to generate sequences of bt which willtheoretically belong to the tru e posterior distribution.



380 Chanaeetectionased on mul t ide models

Figure 10.1. The multiple model approach.

(1995). In Daumera and Falka (1998), multiple models are used to find thechange points in biomedical time series. In Caputi (1995), the multiple modelis used to model the inp ut to a linear system as a switching Gaussian process.Actuator and sensor faults are modeled by multiple models in Maybeck andHanlon (1995). Wheaton and Maybeck (1995) used the multiple model ap-proach for acceleration modeling in target tracking, and Yeddanapudi et al.(1997) applied the framework to target tracking in ATC. These are just a fewexamples, more references can be found in Section 10.3.4. Below, importantspecial cases of the general model are listed as examples. It should be stressed

that the general a lgorithm and its approximations can be applied t o all ofthem.

Example 70.1 Detection in changing mean model

Consider the case of an unknown constan t in white noise. Suppose thatwe want to te st the hypothesis th a t th e 'constant' has been changed at someunknown time in stant. We can then model the signal by

yt = 81

+a ( t

+l )

+et,

where a ( t ) s the step function. If all possible change instan ts are t o be consid-ered, the variable S takes its value from the set { l , ,. , 1, } , where S = t



10 2 ExamDles of amlications 381

should be interpreted as no change (yet). This example can be interpreted asa special case of l O . l ) , where

1S = {1,2,. . , t } , xt = ( 8 1 , 8 2 ) ~ , A t ( S )= ( 0 a ( t S + 1)

Ct(S) = (1, l ) , Q t ( S ) = 0. (S) = X.

The detection problem is to estimate S.

Example 70.2 Segmentation in changing mean mo del

Suppose in Example 10.1 th at there can be arbitrarily many changes inthe mean. The model used can be extended by including more ste p functions,bu t such a description would be rather inconvenient. A better alternative tomodel the signal is

+l = + Stvt

Yt = + et

6 E{O, 11.

Here the changes are modeled as the noise ut , and the discrete parameter S,is 1 if a change occurs at time t and 0 otherwise. Obviously, this is a specialcase of (10.1) where the discrete variable is S N = ( S l , b ~ , . ,S and

S N = (0, xt = B t , At( S )= 1, Ct = 1, Q t ( S ) = StQt, Rt = X.

Here (0, denotes all possible sequences of zeros and ones of length N . Theproblem of est imating the sequence S N is called segmentation.

Example 70.3 Model structure selection

Suppose th at there are two possible model structures for describing a mea-sured signal, namely two auto-regressions with one or two parameters,

6 = 1 : yt = -alyt-l+ et6 = 2 : yt = -alyt-l a2yt-2 + e t .

Here, et is white Gaussian noise with variance X. We want to determine from

a given da ta set which model is th e most suitable. One solution is to refer tothe general problem with discrete parameters in (10.1). Here we can take

At(6) = I , Q t ( S ) = 0, (S) = X



382 Changeetect ionased on multiple models

and

The problem of estimating S is called model structure selection.

Example 10.4 Equalization

A typical digital communication problem s to estimate a binary signal, ut,transmitted through a channel with a known characteristic and measured a t

the output. A simple example is

We refer to th e roblem of estimating the inpu t equence with a known channelas equalization.

Example 10.5 Blind equalization

Consider again the communication problem in Example 10.4, but assumenow that bo th the channel model and the binary signal are unknown a priori.

We can try to estimate the channel parameters as well by using th e model

Th e problem of estimat ing the input sequence with a n unknown channel iscalled blind equalization.



10 2 ExamDles of amlications 383

Example 10.6 Outliers

In practice it is not uncommon th at some of the measurements are muchworse then the others. These are usually called outliers. See Huber (1981) fora thorough treatment of this problem. Consider, for instance, a state spacemodel

Zt+l + ut

Y t + e t , (10.4)

where some of the measurements are known to be bad. One possible approachto this problem is to model the measurement noise as a Gaussian mixture,

M

i=l

where C cq = 1. With th is nota tion we mean tha t th e density function for etis

M

i=l

In this way, any density function can be approximated arbitrari ly well, in-cluding heavy-tailed distribut ions describing the outliers. To pu t a Gaussianmixture in our framework, express it as

( N ( p 1 ,Q 1 ) with robability a 1

N ( p 2 ,Q 2 ) with robability a 2et E .

[ ; P M , Q M ) with probability Q M .

Hence, the noise distribution can be written

where St E {1 , 2 , . . . ,M } and the prior is chosen as p ( = i ) = a i .

The simplest way to describe possible outliers is to take p1 = p 2 = 0, Q 1

equal to the nominal noise variance, Q 2 as much larger than Q 1 and a 2 = 1 a 1

equal to a small number. Thi s models the fraction a 2 of all measurements asoutliers with a very large variance. The Kalman filter will then ignore thesemeasurements, and the a posteriori probabilities are almost unchanged.




Example 10.7 Missing data

In some applications it frequently happens ha t measurements are missing,typically due to sensor failure. A suitable model for this situation is

Zt+l = A t a + ut

yt =( 1 )Ctxt et. (10.5)

This model is used in Lainiotis (1971). The model (10.5) corresponds to thechoices

in the general formulation (10.1). For a thorough trea tment of missing da ta ,see Tanaka and Katayama (1990) and Parzen (1984).

Example 10.8 Markov models

Consider again the case of missing data , modeled by (10.5). In applica tions,one can expect th at a very low fraction, say p l l , of th e da ta s missing. On theother hand , if one measurement is missing, there is a fairly high probability,say p22, th at th e next one is missing as well. This is noth ing but a priorassumption on 6, corresponding to a Markov chain. Such a stat e space modelis commonly referred to as a j u m p linear model. A Markov chain is completelyspecified by its transition probabilities

and the initial probabilities p ( = i ) = p i . Here we must have p12 = 1 2 2

and p21 = 1 p11 . In our framework, this is only a recursive description of th eprior probability of each sequence,

For outliers, and especially missing data, the assumption of an underlyingMarkov chain is particularly logical. It is used, for instance, n MacGarty(1975).



10.3 On-line alaorithms 385

10.3. On-line algorithms

10.3.1. General deas

Inte rpre t the exponentially increasing number of discrete sequences St as agrowing tree, as llustrated n Figure 10.2. It is inevitable that we eitherprune or merge this tree.

In this section, we examine how one can discard elements in S by cuttingoff branches in the tree , and lump sequences into subsets of S by mergingbranches.

Thus, the basic possibilities for pruning th e tree are to cut 08 ranchesand to merge two or more branches into one. Th at is, two sta te sequences aremerged and in the following trea ted as just ne. There is also a timing question:at what instan t in the time recursion should the pruning be performed? Tounderstand this, the main steps n updatin g the a posteriori probabilities canbe divided into a t im e update and a measurement update as follows:

i l

i =2

i =3

i =4

i =5

i =6

i =7

i =8

Figure 10.2. A growing tree of discrete st at e sequences. In GPB(2) the sequences 1 ,5 ) ,(2,6), 3,7) and (4,8), respectively, are merged. In GPB(1) th e sequences 1,3,5,7) an d(2,4,6,8), respectively, are merged.



386 Changeetect ionased on multiple models

Time update:

(10.6)

(10.7)

Measurement update:

Here, the splitting of each branch into S branches is performed in (10.7). Wedefine the most probable branch as th e sequence bt, with the largest a posterioriprobability p St lyt) in (10.8).

A survey on proposed search strategies n th e different applications nSection 10.2 is presented in Section 10.3.4.

10.3.2. Pruning algorithms

First, a quite general pruning algorithm is given.

Algorithm 70 7 Multiple model pruning

1. Compute recursively th e conditional Kalman filter for a bank of M se-quences @ i) = ( S l ( i ) , ( i ) ,. . . 6 t ( i ) ) T, = 1 ,2 , . . . , M .

2. After the measurement u pda te at time t , prune all but the M / S mostprobable branches St(i).

3. At time t + 1: let the M / S considered branches split into S . M / S = Mbranches, S t s l ( j ) = ( s t ( i ) ,&+l ) or all @ i) and + l . Update heir aposteriori probabilities according to Theorem 10.1.

For change detection purposes, where 6, = 0 is th e normal outcome andSt 0 corresponds to different fault modes, we can save a lot of filters in thefilter bank by using a local search scheme similar to tha t in Algorithm 7.1.

Algorithm 70.2 Local pruning for multiple models

1. Compute recursively th e conditional Kalman filter for a bank of M hy-

potheses of @ i) = ( S l ( i ) , 2 ( i ) ,. . . , t ( i ) ) T, = 1 ,2 , . . . ,M .2. After the measurement update at time t , prune the S 1 least probable

branches S t .



10.3 On-line alaorithms 387

3. At time t + 1: let only th e most probable branch split into S branches,S t + l j ) = (Jt(i), S,+l).

4. Update their posterior probabilities according to Theorem 10.1.

Some restrictions on the rules above can sometimes be useful:

Assume a minimum segment length: let th e most probable sequence splitonly if it is not too young.

Assure that sequences are not cut off immediately after they are born:cut off the least probable sequences amo ng those that are older th an acertain minimum life-length,until only M ones are left.

10.3.3. Merging strategies

A general merging formula

The exact posterior density of th e st at e vector is a mix ture of St Gaussian dis-tributions. The key point in merging is to replace, or approximate, a numberof Gaussian distributions by one single Gaussian distribution in such a waythat the first and second moments are matched. T hat is, a sum of L Gaussiandistributions

L

i = l

is approximated by

P6-4 = aN 27 P ) ,

where

L

Q = c ( i )i = l

4 L

Ic =- c (i>?(i)Q .

2 = 1

1 L

P =- C Q i) P ( i ) + ( 2 ( i ) 2)(2(i) 2) ' ) .Q .

2 = 1

The second term in P is the spread ofthe mean(see 10.21)). It is easy to verifythat he expectation and covariance are unchanged under he distributionapproximation. When merging, all discrete information of th e history is lost.That is, merging is less useful for fault detection and isolation than pruning.




The GP6 algorithm

Th e idea of the Generalized Pseudo-Bayesian (GPB)approach is to merge th emixture after the measurement update.

Algorithm 70.3 GPB

The mode parameter 6 is an independent sequence with S outcomes used toswitch modes in a linear st at e space model. Decide on th e sliding windowmemory L. Represent the posterior distribution of the state at time t with aGaussian mixture of M = SL-' distributions,

i = l

Repeat the following recursion:

1. Let these split into S L sequences by considering all S new branches attime t + 1.

2. For each i, apply the conditional Kalman filter measurement nd time up-

da te giving ? t + l l t ( 4 , i t + l l t + l ( i ) , Pt+llt(i), t+llt+l(i), t +& ) and St+lW

3. Time update the weight factors a ( i ) according to

4. Measurement update the weight factors a ( i ) according to

5. Merge S sequences corresponding to th e same history up to time t L.This requires SL-l separate merging steps using the formula

S

Q =c ( i )i = l

1S

2 =- c ( i ) 2 ( i )Q .

2 = 1

P=-

C a ( i ) P ( i ) + ( 2 ( i ) 2)(2(i) 2) ' )Q .2 = 1



10 3 On-line alaorithms 389

The hypotheses that are merged are identical up to time t L . That is,we do a complete search in a sliding window of size L. In th e extreme case ofL = 0, all hypotheses are merged at the end of the measurement update. This

leaves us with S time and measurement updates. Figure 10.2 illustrates howthe memory L influences the search strategy.

Note that we prefer to call ai weight factors ra ther than posterior probabil-ities, as in Theorem 10.1. First, we do not bother to compute the appropriatescaling factors (which are never needed), and secondly, these a re probabilitiesof merged sequences that are not easy to interprete afterwards.

The IMM algorithm

The IMM algorithm is very similar to GP B. Th e nly difference is tha t mergingis applied after the time update of th e weights rather than afte r the measure-ment update. In this way, a lot of time updates are omitted, which usually donot contribute t o performance. Computationally, IMM should be seen as animprovement over GPB.

Algorithm 10.4 IMM

As the GP B Algorithm 10.3, but change the order of steps 4 and 5.

For targe t tracking, IMM has become a standard method (Bar-Shalom andFortmann, 1988; Bar-Shalom and Li, 1993). Here there is an ambiguity in howthe mode parameter should be utilized in t he model. A survey of alternativesis given in Efe and Atherton (1998).

10.3.4. A iterature survey

Detection

In detection, the number of branches increases linearly in time: one branchfor each possible time instant for the change. An approximation suggested inWillsky and Jones (1976) is to rest rict the jump to a window, so that onlyjumps in the, say, L last time instants are considered. This is an example ofglobal search, and it would have been the opt imal thin g to do f there really wasa finite memory in the process, so new measurements contain no informationabout possible jumps L time instants ago. This leaves L branches in the tree.

Segmentation

A common approach to segmentation is to apply a recursive detection method,which is restarted each time a jump is decided. This is clearly also a sort ofglobal search.



390 Chanae detection basedn multide models

A pruning strategy is proposed in Andersson (1985). The method is calledAdaptive Forgetting through M ultiple Models ( A F M M ) ,and basically is a vari-ant of Algorithm 10.2.

Equalization

For equalization there is an optimal search algorithm for a finite impulse re-sponse channel, namely the Viterbi algorithm 5.5. Th e Viterbi algorithm isin its simplicity indeed the most powerful result for search strategies. Th eassumption of finite memory is, however, not very often satisfied. Equaliza-tion of FIR channels is one exception. In our terminology, one can say thatthe Viterbi algorithm uses a sliding window, where all possible sequences are

examined.Despite the optimality and finite dimensionality of t he Viterbi algorithm,

the memory, and accordingly also the number of branches, is sometimes toohigh. Therefore, a number of approximate search algorithms have been sug-gested. The simplest example is Decision-directed Feedback (DF), where onlythe most probable branch is saved at each time instant.

An example of a global search is Reduced State Space Estimation(RSSE),proposed in Eyuboglu and Qureshi (1988). Similar algorithms are indepen-dently developed in Duel-Hallen and Heegard (1989) and Chevillat and Eleft-

heriou (1989). Here, the possible sta te sequences are merged to classes ofsequences. One example is when the size of th e optimal Viterbi window isdecreased to less than L. A different and more complicated merging scheme,called State Space Partitioning SSP), appears in Larsson (1991).

A pruning approach is used in Aulin (1991), and it is there called SearchAlgori thm (SA). Apparently, th e same algorithm is used in Anderson andMohan (1984), where it is called th e M-algor i thm. In both algorithms, the Mlocally best sequences survive.

In Aulin (1991) and Seshadri and Sundberg (1989), a search algorithmis proposed, called the HA(B, L ) and generalized Vite rbi alg orith m (GVA),respectively. Here, the B most probable sequences preceding all combinationsof sequences in a sliding window of size L are saved, making a total of B S Lbranches. The HA( B, L ) thus contains DF, RSSE, SA and even the Viterbialgorithm as special cases.

Blind equalization

In blind equalization, the app roach of examining each input sequence in a treestructure is quite new. However, th e DF algorithm, see Sat0 (1975), can beconsidered as a local search where only th e most probable branch is saved.In Sat0 (1975), an approximation is proposed, where the possible inputs are



10.4 Off-line alaorithms 391

merged into two classes: one for positive and one for negative values of theinput. The most probable branch defined in these two classes is saved. Thisalgorithm is, however, not a n approximation of the optimal a lgorithm; ra ther

of a suboptimal one where the LMS (Least Mean Squares, see Ljung andSoderstrom (1983)) algorithm is used for updating the parameter estimate .

Markov models

The search strategy problem is perhaps best developed in th e context ofMarkov models; see the excellent survey Tugnait (1982) and also Blom andBar-Shalom (1988).

The earliest reference on this subject is Ackerson and F'u (1970). In their

global algorithm the Gaussian mixture at time remember th at th e posteri-ori distribution from a Gaussian prior is a Gaussian mixture s approximatedby one Gaussian distribution. Tha t is, all branches are merged into one. Thisapproach is also used in Wernersson (1975), Kazakov (1979) and Segal (1979).

An extension on this algorithm is given in Jaffer and Gupta (1971). Here,all possible sequences over a sliding window are considered, and the precedingsequences are merged by using one Gaussian distribution just as above. Twospecial cases of this algorithm are given in Bruckner et al. (1973) and Changand Athans (1978), where th e window size is one. The difference is just the

model complexity. The most general algorithm in Jaffer and Gupta (1971) isthe Generalized Pseud o B ayes (GPB), which got this name in Blom and Bar-Shalom (1988), see Algorithm 10.3. T he InteractingMultiple Model (1 )algorithm was proposed in Blom and Bar-Shalom (1988). As sta ted before,the difference of GPB an d IMM is the timing when th e merging is performed.Merging in (10.7) gives the IMM and after (10.9) the GPB algorithm.

An unconventional approach to global search is presented in Akashi andKumamoto (1977). Here the small number of considered sequences are chosenat random. This was before genetic algorithms and MCMC methods become

popular.A pruning scheme, given in Tugnait (1979), is the Detection-Estimation

Algori thm (DEA). Here, th e M most likely sequences are saved.

10.4. Off-line algorithms

10.4.1. The EM algorithm

The Expectat ion Maximization (EM) algorithm (see Baum et al. (1970)), al-ternates between estimating the state vector by a conditional mean, given asequence J N and maximizing the posterior probability p ( S N I z N ) . Applica-tion to st at e space models and some recursive implementations are surveyed




in Krishnamurthy and Moore (1993). The MCMC Algorithm 10.6 can beinterpreted as a stochastic version of EM.

10.4.2. MCMC algorithmsMCMC algorithms are off-line, though related simulation-based methods havebeen proposed to find recursive implementations (Bergman, 1999; de Freitaset al., 2000; Doucet, 1998).

Algorithm 7.2 proposed in Fitzgerald et al. (1994) for segmentation, canbe generalized to th e general detection problem. I t is here formulated for thecase of binary S,, which can be represented by n change times k1,k z ,. . . , n.

Algorithm 70.5 Gibbs-Metropolis MCMC detection

Decide the number of changes n:

1.

2.

3.

4.

Iterate Monte Carlo run i

Iterate Gibbs sampler for component j in k n , where a random numberfrom

i j - (k j lk1 ,k 2 , . , j - l , k j + l , n )

is taken. Denote the new candidate sequence p. he distribu tion may

be taken as flat, or Gaussian centered around 5 . If independent j um pinstants are assumed, this task implifies t o taking random numbers j -Run the conditional Kalman filter using the sequence p , nd save theinnovations E t and its covariances St.

The candidate j is accepted with probability

P ( W .

That is, if th e likelihood increases we always keep the new candidate.Otherwise we keep it with a certain probability which depends on itslikeliness. This random rejection step is the Metropolis step.

After the burn-in (convergence) time, the distribu tion of change times can becomputed by Monte Carlo techniques.

Example 70.9 Gibbs-Metropolis change detection

Consider the tracking example described in Section 8.12.2. Figure 10.3(a)shows the tra jec tory and the result from the Kalman filter. Th e ju mp hy-pothesis is that the state covariance is ten times larger Q(1) = 10 Q(0). The



10.4 Off-line alaorithms 393

104 Kalmanfi l te r 104 Finer bank

3.5

3 -

-3.5

2.5

- J / f. c

2

1.5- ,,* -

1

1.5-

1

0.5 0 .5 -_ C,,*

00

*

-0.5- easuredosition* Estimated osition

easured position-* Estimated osition

0 1 2 3 4 5 0 1 2 4 5 61 m

X 10 X 10

(a) (b)Figure 10.3. Tra ject ory and filtered position estimate from Kalman filter ( a) and Gibbs-Metropolis algorithm with n = 5 (b).

Figure 10.4. Convergence of Gibbs sequence of change imes. To the left for n = 2 andto th e right for n = 5 . For each iteration, there are n sub-ite rations , in which each change

time in the curr ent sequence k is replaced by a random one. The accepted sequences kare marked with X .

distribution in step 5 is Gaussian N(O,5). Th e ju mp sequences as a functionof iterations and sub-iterations of the Gibbs sampler are shown in Figure 10.4.The iteration scheme converges to two change points at 20 and 36. For thecase of overestimating the change points, these are placed at one border of thedata sequence (here at the end) . The mprovement in tracking performance isshown in Figure 10.3(b). The dist ribu tion we would like to have in practiceis the one from Monte Carlo simulations. Figure 10.5 shows the result fromKalman filter whiteness est and a Kalman filter bank. See Bergman andGustafsson (1999) for more information.




100

90

80

70

60

50

40

30

20

10

Histogram over KF filter bank using100 Monte Carlo uns

10 20 30 400 60

(b)

Figure 10.5. Histograms of estimated change times from 1 Monte Carlo simulations usingKalman filter whiteness test and a Kalman filter bank.

We can also generalize the MCMC approach given in Chapter 4.

Algorithm 70.6 MCMC change detection

Assume Gaussian noise in the state space model (10.1). Denote t he sequenceof mode parameters S N ( i ) = ( S I , . ,b ~ ) ~ ,nd the stacked vector of sta tesfor X = ( xl , . , X )'. The Gibbs sequence of change times is generated byalterna ting taking random samples from

( x N ) ( i + l ) - p ( z N l y N , ( S N ) ( i ) )

( b N ) ( i + l ) -p(6 N IyN , x ) ( i + l ) .

Th e first distribution is given by the conditional Kalman smoother, since

The second distribution is

Here At denotes the Moore-Penrose pseudo-inverse.

The interpre tation of t he last s tep is th e following: If the smoothed se-quence xt is changing rapidly (large derivative), then a change is associatedwith th a t time instant with high probability.



10 5 Local Druninan blindaualization 395

A computational advantage of assuming independence is th at th e randomSt variables can be generated independently. An alternative is t o assume ahidden Markov model for the sequence S N .

10.5. Local pruning n blind equalization

As an applica tion, we will study blind equalization. Th e algorithm is astraightforward application of Algorithm 10.1. I t was proposed in Gustafssonand Wahlberg (1995), and a similar algorithm is called MAPSD (Maximum APosteriori Sequence Detection) (Giridhar et al., 1996).

10.5.1. Algorithm

A time-varying ARX model is used for the channel:

where

The unknown input S, belongs t o a finite alphabet of size S. With mainly nota-tional changes, this model can encompass multi-variable nd complex channelsas well. The pruning Algorithm 10.1 now becomes Algorithm 10.7.

Algorithm 10.7 Blind equalization

Assume there are M sequences S t - ' ( i ) given at time t 1, and tha t theirrelative a posteriori probabilities p ( S t P 1 i ) ') have been computed. At timet , compute the following:

1. Evaluation: Update p St i)lyt) y




Here pt i) = p t ( @ ( i ) ) nd ( i ) are conditional of the input sequenceS t ( i ) . This gives S M sequences, by considering all S expansions of eachsequence a t time t .

4. Repeat from step 1.

We conclude with a numerical evaluation.

Example 10.10 Blind equalization using multiple models

In this example we will examine how Algorithm 10.7 performs in the caseof a Rayleigh fading communication channel. Rayleigh fading is an important

problem in mobile communication. Th e motion of t he receiver causes a time-varying channel characteristics. The Rayleigh fading channel is simulatedusing the following premises: The frequency of the carrier wave is 900 MHz,and the baseband sampling frequency is 25 kHz. Th e receiver is moving withthe velocity 83 km/h so the maximum Doppler frequency can be shown to beapproximately 7 Hz. A channel with two time-varying taps, correspondingto this maximum Doppler frequency, will be used.' An example of a tap isshown in Figure 10.6. For more details and a thorough treatment of fading inmobile communication, see Lee (1982).

In Figure 10.6(a), a typical parameter convergence is shown. Th e trueFIR parameter values are here compared to th e least squares estimates con-ditioned on the estimated input sequence at time t . The convergence to th etrue parameter settings is quite fast (only a few samples are needed), and thetracking ability very good.

To test the sensitivity to noise, the measurement noise variance is variedover a wide range. Figure 10.6(b) shows Bit Error Rate (BER) as a function of

'Th e taps are simulated by filtering white Gaussian noise with unit variance by a secondorder resonance filter, with the resonance frequency equal to 70/25000 Hz, followed by aseventh order Butterworth low-pass filter wi th cut-off frequency 7r/2 70/25000



1 0 . A Posterior distribution 397

2 0.45

-0.5 I50 100 150 200 250 300

0 4

0.35-

0 3

6 .25-

0.2

0.15

0.1

0 . 0 5 L

l

5 10 15 20 25 30SNR [dB]

( 4 (b)

Figure 10.6. Example of estimated and true parameters in a Rayleigh fading channel (a).Bit-error as a function of SNR with 64 parallel filters, using Monte Carlo average over 10simulations (b).

Signal-to-Noise Ratio SNR).Rather surprisingly, the mean performance is notmuch worse compared to th e Viterbi algorithm in Example 5.19. Comparedto the result in Figure 5.29, the parameters are here rapidly changing in time,

and there is no training sequence available.

1O.A. Posterior distribution

Theorem 1 0 . 1 ( M A P estimate of dN)Consider the signal model (10.1). The MAP estimate of J N is given by mini-mizing its negative a posteriori probability

-2 logp(6 N Ny ) = 2 logp(6N) + 2 logp(yN) + N p log 27rN

+c logdet S t ( J N ) + E~ 6N)S;1 6N)Et 6N)}. (10.18)t= l




Corollary 10.2Under the same assumpt ions and nota t ions in Th e o re m 10.1, th e a posteriorid is t r ibut ion of thecontinuous s tate vector is a Gaussian mixture

Here S N i s t h e f i n i t e n u m b e r o f d i f f e r e n t6s, arbitrarily enum erated by t h ei n d e x i, and 0 , .) i s the Gauss ian dens ity func tion . I ft < N , Q N and PtlNdenote the sm oothed s tate es t imate and i ts covariance matrix, respectively.For t = N and t = N + 1, we take the Ka lm an fi l ter nd predictor quantit ies ,respectively.

Proof: The law of total probability gives, since the different 6's are mutuallyexclusive,

6

For Gaussian noise, p ( z t l y N ,S) is Gaussian with mean and covariance as givenby the filter. 0

The M A P estimator of xt is arg max,, p(xtlyN). Another possible estimate

is the conditionalexpectation, which coincides with he minimum varianceestimate (see Section 13.2), and it is given by

It is often interesting to compute or plot confidence regions for the sta te . Thisis quite complicated for Gaussian mixtures. What one can do is compute theconditional covariance matrix PtlN for xt, given t he measurements y N . Thiscan then be used for giving approximate confidence regions for th e conditionalmean. Using the spread of the mean formula

where index k means with respect to th e probabilityconditional covariance matrix follows, as

distribution for k , the

- 2 . t ( i t M V )



400 Chanae detection basedn multide models

Thus, the a posteriori covariance matrix for x t , given measurements y N isgiven by three terms. The first is a weighted mean of the covariance matricesfor each S. The last two take the variation in the estimates themselves in to

consideration. If the estimate of xt is approximately the same for all S thefirst t erm is dominating, otherwise the variations in estimate might make th ecovariance matrices negligible.

10.A.2. Unknown noise evel

It is wellknown th at th e propert ies of th e Kalman filter is scale invariant withrespect to th e noise covariance matrices Q and R. Suppose that all priorcovariance matrices are scaled a factor X. Marking the new quant ities with

bars we have,

It is easily checked from the Kalman filter equations th at th e estimatesare still the same. Th e only difference is that all covariance matrices are scaleda factor X,

Thus, for pure filtering the ac tual level of th e covariance matrices does notneed to be known.

However, from equation (10.18) in Theorem 10.1 it can easily be checkedthat the scaling changes the a posteriori probability of S as

-2 logp(S1yN) = 2 logp(SlyN, X = 1)

1X

N

+NplogX (1 ) C€;(S)s;yS)€t(6),t = l

and this can have quite serious effects. Consider, for instance, th e case when(10.1) describes a single output linear regression. Then, X scales the measure-ment noise variance. Over-estimating t gives non-informative results, sincethe influence of t he prediction errors on (10.18) is almost negligible in thiscase. The MAP estimator then chooses the S which gives the smallest predic-tion error covariance matrices, which is independent of the prediction errors.

Thus, there is a need for robustifying the MAP estimate with respect tosuch unknown levels. This can be done by considering X as a stochastic variableand using marginalization.




Theorem 10.3Consider the model (10.1) under the same assumptions as in Theorem 10.1,but with an unkno wn evelX in (10 .22) . Theni f X is considered as a stochasticvariable with no prior information, the a posteriori probabil i ty forN is givenbY

N

-2 logp(SN yN) =C 2 logp(SN) +c og det ( S t )t= l

N

( N p 2) l 0 g C €T(st)s,-l(St)€t(St), (10.23)t= l

where

Here r (n ) s the gamma-funct ion .

Proof: The proof is a minor modification of Theorem 10.1, which is quitesimilar to Theorem 7.4. 0

At this stage, it might be interesting to compare the expressions (10.23)and (10.18) for known and stochastic noise level, respectively. Let

N N

D N = c og det ( S t ) and VN = c F (S t )S F IE t (S t ) .t = l t = l

Clearly, the estimator here minimizes DN + N log VN rather than DN + V N ,

S = arg min D N + V N ,t

8hL=

arg min DN + ( N p 2) log V N , if X is stochastic.

if X is known

Exactly the same statistics are involved; th e difference being only how thenormalized sum of residuals appears.



Change detection based onalgebraical consistency tests

11.1. B a s i c s 4 0 311.2. P a r i t y space change d e t e c t i o n . . . . . . . . . . . . . . . 07

11.2.1. Algorithmderivation 407

11.2.2.Som e rank results 40911.2.3.Sensitivityandrobustness . . . . . . . . . . . . . . . . . . 411

11.2 .4. Uncontrollable ault states 41211.2.5.Open problems . . . . . . . . . . . . . . . . . . . . . . . . 413

11 . 3 . An o b s e r v e r approach. 1 311.4. An i n p u t - o u t p u t approach . . . . . . . . . . . . . . . . . 14

11 . 5 .A p p l i c a t i o n s 4 1 511.5.1.Simulated DC motor . . . . . . . . . . . . . . . . . . . . . 415

11.5.2. DC motor 41711.5.3. Vertical aircraft dynamics . . . . . . . . . . . . . . . . . . 419

11 l . Basics

Consider a batch of data over a sliding window, collected in a measurementvector Y and nput vector U . As in Chapter 6, the idea of a consistencytest is to apply a linear transformation to a batch of data, AiY + BiU + c i .The matrices Ai, Bi and vector G are chosen so that the norm of the lineartransformation is small when there is no change/fault according to hypothesisH i , and large when fault Hi has appeared. Th e approach n his chaptermeasures the size of

llAiY+BiU +tillas a distance function in a algebraic meaning (in contrast to the stat isticalmeaning in Chapter 6 ). The distance measure becomes exactly ‘zero’ in th enon-faulty case, and any deviation from zero is explained by modeling errorsand unmodeled disturbances rather than noise.





404 Change detection based on algebraical consistency ests

That is, we will study a noise free state space model

xt+l -k Bu tut -k Bd,td t -k B f t f t (11.1)

Y t =Ctx t -k Du,tUt -k Dd,tdt -k D f , t f t - (11.2)Th e differences to, for example, the st ate space model (9.1) are the following:

Th e measurement noise is removed.

Th e stat e noise is replaced by a deterministic disturbance d t , which mayhave a direct term to y t .

The state change may have a dynamic profile f t . This is more generalthan the step change in (9.1), which is a special case with

f t = Ot-kV.

There is an ambiguity over how to split the influence from one faultbetween B f , D f and f t . Th e convention here is that the fault directionis included in B f , D f (which are typically time-invariant). That is, B; f :

is the influence of fault i , and the calar fuult profile f : is the time-varyingsize of the fault.

The algebraic approach uses th e batch model

& = OZt-L+1 -k H u u t -k HdDt -k H f F t , (11.3)

where

For simplicity, time-invariant matrices are assumed here. Th e Hankel matr ixH is defined identically for all three input signals U , d and f (the subscriptdefines which one is meant).

The idea now is to compute a linear combination of dat a, which is usuallyreferred to as a residual

rt A wT( &ut) = w T ( O x t - L + l + H & + H f F t ) . (11.5)



1 1 l Basics 405

This equation assumes a stable system (Kinnaert et al., 1995). The equationrt = 0 is called a parity equation. For t he residual to be useful, we have toimpose the following constraints on the choice of the vector or matrix W :

1.

2.

3.

Insensitive to the value of the state x t and disturbances:

W T ( 0 , H d ) = 0. (11.6)

That is, rt = w T ( K H , U t ) = w T ( O x t - L + l +H & ) = 0 when there isno fault. The columns of wT are therefore vectors in he null space of thematrix ( 0 , H d ) . This can be referred to as decoupling of the residualsfrom the stat e vector and disturbances.

Sensitive to faults:

w T H f 0. (11.7)

Together with condition 1, this implies rt = w T ( Y, - H , U t ) = w T H f F t0 whenever Ft # 0.

For isolation, we would like th e residual to react differently to th e differ-ent faults. That is, th e residual vectors from different faults f , f 2 , . . . ,f f should form a certain pat tern , called a residual structure R. Thereare two possible approaches:

a) Transformation of the residuals;

T r t = T w T H f = Ri. (11.8)

This design assumes stationarity in the fault. ha t is, its magnitudeis constant within the sliding window. This implies th at there willbe a transien t in the residual of length L . See Table 11.1 for twocommon examples on structures R.

b) For fault decoupling, a slight modification of the null space aboveis needed. Let H i be he fault matrix from fault fz There arenow n f such matrices. Replace (11.6) and (11.7) with the iterativescheme

wT ( 0 , Hd H i . . . H i - l . . . H T f ) =O

W H f #O.i

In his way, the transien ts in, for instance, Figure 11.4 will dis-appear. A risk with the previous design is that the time profile i

might excite other residuals causing incorrect isolation, risk whichis eliminated here. On the other hand, it should be remarked th atdetection should be done faster th an isolation, which should bedone after transient effects have passed away.



406 Change detection based on algebraical consistency ests

Table 11.1. The residual vector r t = wT K - H,Ut) is one of th e columns of the residualstructure matrix R , of which two common examples are given.

Frt 3 ) 1 0

The derivation based on the three first conditions is given in Section 11.2.Th e main disadvantage of this approach is sensi t iv i tyand robustness. That

is, the residuals become quite noisy even for small levels of the measurementnoise, or when th e model used in the design deviates from the true system.This problem is not well treated in the literature, and there are o design rulesto be found. However, it should be clear from the previously described designapproaches tha t it is the window size L which is the main design parameterto trad e off sensitivity (and robustness) to decreased detection performance.

Equation (11.5) can be expressed in filter form as

rt = A q)Yt B( ,

where A(q) and B ( q ) are polynomials of order L. There are approaches de-scribed in Section 11.4, th a t design th e FIR filters in th e frequency domain.There is also a close link to observer design presented in Section 11.3. Ba-sically, (11.5) is a dead-beat observer of the faults. Other observer designscorrespond t o pole placements different from origin, which can be achieved byfiltering the residuals in (11.5) by a n observer polynomial C ( q ) .

Another approach is also based on observers. The idea is to ru n a bank offilters, each one using only one output. This is called an dedicated observer in

Clark (1979). The observer outputs are then compared and by a simple votingstrategy faulty sensors are detected. The residual st ruc ture is the left one inTable 11.1. A variant of this is to include all but one of th e measurements inthe observer. This is an efficient solution for, for example, navigation systems,since the recovery after a detected change is simple; use the output from theobserver not using the faulty sensor. A fault in one sensor will then affect allbut one observer, and voting can be applied according to the ight-hand struc-ture in Table 11.1. An extension of this idea is to design observers that useall but a subset of inputs, corresponding to the hypothesized faulty actuators.This is called unknown input observer (Wiinnenberg, 1990). A further alter-native is the generalized observer, see Patton (1994) and Wiinnenberg (1990),which is outlined in Section 11.3. Finally, i t will be argued tha t all observer



11 .2 Parity space changeetection 407

and frequency domain approaches are equivalent to , or special cases of, theparity space approach detailed here.

Literature

For ten years, the collection by Patton et al. (1989) was the main reference inthis area. Now, there are three single-authored monographs in Gertler (1998),Chen and Patton (1999) and Mangoubi (1998). There are also several surveypapers, of which we mention Isermann and Balle (1997), Isermann (1997) andGertler (1997).

The approaches to design suitable residuals are: parity space design (Chowand Willsky, 1984; Ding et al., 1999; Gertler, 1997), unknown input observer

(Hong et al., 1997; Hou and Patto n, 1998; Wang an d Daley, 1996; Wunnenberg,1990; Yang and Saif, 1998) and in the frequency domain (Frank and Ding,199413; Sauter and Hamelin, 1999). A completely different approach is based onreasoning and computer science, and examples here are A r z h (1996), Blankeet al. 1997), Larsson (1994) and Larsson (1999). In he at ter approach,Boolean logics and object-orientation are keywords.

A logical approach to merge the deterministic modeling of th is chapterwith the stochastic models used by the Kalman filter appears in Keller (1999).

11.2. Parity space chang e detection

11.2.1. Algorithm derivation

The three conditions (11.6)-(11.8) are used to derive Algorithm 11.1 below.Condition (11.6) implies that W belongs t o the null space of (0 Hd). Thiscan be computed by a Singular Value Decomposi tion ( S V D ) :

( 0 Hd) = UDVT.

These matrices are written in standa rd notation and should not be confusedwith U , and D,, etc. in the model (11.3). Here D is a diagonal matrix withelements equal t o th e singular values of 0 H d ) , and its left eigenvectors arethe rows of U . The null space N of 0 Hd) is spanned by t he last columnsof U , corresponding to eigenvalues zero. In th e following, we will use the samenotation for the null space N , as for a basis represented by th e rows of a matrixN . In MAT LAB^ notation, we can take

[UyD,V1=svd( O Hdl) ;

n=rank(D) ;

N=U(:,n+l:end) ;



408 Chanae detection based on alaebraical consistencv ests

The MATLABTM function null computes the null space directly and gives aslightly different basis. Condition (11.6) is satisfied for any linear combinationof N ,

wT = T N ,

where T is an arbitrary (square or thick) matrix. To satisfy condition (11.7),we just have to check that no rows of N are orthogonal t o H f If this is thecase, these are then deleted and we save N . In MATLABTM notation, we take

ind= [l ;

for i=l:size(N,l);a=N(i,:)*Hf;

if a l l a==O) ;

ind= [ind il ;

endendN(ind, : )=[ l

That is, the number of rows in N say n s , determines how many residualsthat can be computed, which is an upper bound on the number of faults th atcan be detected. This step can be skipped if the isolation design described

next is included.The last thing to do is t o choose T to facilitate isolation. Assume there

are n f 5 nN faults, in directions f , 2 , . . . , f . Isolation design is done byfirst choosing a residual structure R. The two popular choices in Table 11.1are

R=ey e (nf ;

R=ones(nf)-eye(nf1;

The transformation matrix is then chosen as th e solution to th e equationsystem

T N H f 1 ~ ( f f 2 f f ) ) =R

wT =TN.

Here 1~ is a vector of L ones and @ denotes the Kronecker product. Tha tis, 1~ @ f i is another way of writing F: when the fault magnitude is constantduring the sliding window. In MATLABTM notation for n f = 3, this is done

by



11 .2 Parity m a ce chanaeetection 409

To summarize, we have the following algorithm:

Algorithm 7 7 Pari ty space change detection

Given: a sta te space model 11.1).Design parameters: sliding window size L and residual structure R.Compute recursively:

1. The dat a vectors yt and Ut in 11.3) and the model matrices 0 , H d , H f

2. The null space N of ( 0 Hd) is spanned by the last columns of U , cor-responding to eigenvalues zero. In MATLABTMformalism, the transfor-mation matrix giving residual structure R is computed by:

[UyD,V1=svd( O Hdl) ;

n=rank(D) ;

N=U(:,n+l:end);T = R / (N*Hf*kron(ones(L,l),[fl f2 f31));W = (T*N) ;

in 11.4) .

3. Compute the residual r = wT( - &Ut) ,

r=w’*(Y-Hu*U);

4. Change detect ion if rTr > 0, or rTr > h considering model uncertainties.5. Change isolation. Fault i in direction f 2 where i = arg maxi rTRi. Ri

denotes column i of R.

[dum , l =max (r *R ;

It should be noted tha t the esidual structure R is no real design parameter ,but rather a tool for interpreting and illustrating the result.

11.2.2. Some rank results

Rank of null space

When does a parity equation exist? T hat is, when is the size of th e null spaceN different from O? We can do some quick calculations to find the rank of N.We have

rank(N) =n,L - a n k ( 0 ) - ank(&) 11.9)

rank( 0 ) =n, 11.10)

rank(Hd) =ndL 11.11)

+ ank(N) =L(ny d ) n,. 11.12)



41 0 Change detection based on algebraical consistency ests

It is assumed th at th e ranks a re determined by th e column spaces, which isthe case if the number of outputs is larger than the number of disturbances(otherwise condition (11.6) can never be satisfied). In (11.9) it is assumed

th at th e column space of 0 and d do not overlap, otherwise the rank ofN will be larger. Tha t is, a lower bound on the rank of the null space isrank(N) 2 L(nU d ) n,.

Th e calculations above give a condition for detectability. For isolability,compute Ni for each fault fi Isolability is implied by the two conditionsNi # 0 for all i and that Ni is not parallel to Nj for all i # j .

Requirement for isolation

Let

Now we can write th e residuals as

rt = TNH Ft.

The fau lts can be isolated using the residuals if and only if the matrix NHfis of rank n f ,

r a n k ( N k f ) = n f .

In the case of the same number of residuals as faults, we simply take f t =(TNHf)-'rt. The design of the transformation matrix to get the requiredfault structure then also has a unique solution,

T = (NHf)-lR.

It should be remarked, however, th at th e residual structure is cosmetics whichmay be useful for monitoring purposes only. The information is available inthe residuals with or without transformation.

Minimal order residual filters

We discuss why window sizes larger t han L = n, do not need to be considered.The simplest explanation s th at a state observer does not need to be of a higherdimension (the state comprises all of the information about the system, thusalso detection and isolation information). Now so-called Luenberger observerscan be used to further decrease the order of th e observer. Th e idea here is th atthe known part in the state from each measurement can be updated exactly.



11 .2 Parity m a c e chanaeetection 41 1

A similar result exists here as well, of course. A simple derivation goes asfollows.

Take a QR factorization of the basis for the null space,

N = R.

It is clear that by using T = QT we get residuals

rt = Tni(y t - &Ut) = QTQR( - &Ut) = R( - &Ut) .

The matrix R looks like the following:

n d n x

The mat rix has zeros below the diagonal, and the numbers indicate dimen-sions. We just need to use nf esiduals for isolation, which can be taken asthe last nf ows of rt above. When forming these n f residuals, a number ofelements in the last rows in R are zero. Using geometry in the figure above,the filter order is given by

Ln, - L ( n , - d ) +n, +nf Lnd +n, +nf 11.13)n nY

This number must be rounded upwards t o get an integer number of measure-ment vectors. See Section 11.5.3 for an example.

11.2.3. Sensitivity and robustness

Th e design methods presented here are based on purely algebraic relations.It turn s out , from examples, th at th e residual filters are extremely sensitiveto measurement noise and lack robustness to modeling errors. Consider, forexample, the case of measurement noise only, and the case of no fault:

rt = wT( +Et H,Ut) = wTEt.

Here Et is the stacked vector of measurement noises, which has covariancematrix



41 2 Chanae detection based on alaebraical consistencv ests

Here @ denotes Kronecker product. The covariance mat rix of the residuals isgiven by

Cov(rt) = WT Cov(Et)w = W T ( I L 8 R) w. (11.14)

The reason this might blow up is th a t th e components of W might become sev-eral order of magnitudes larger t ha n one, and thus magnifying the noise. SeeSection 11.5.3 for an example. In this section, examples of both measurementnoise and modeling error are given, which show th a t small measurement noiseor a small system change can give very ‘noisy’ residuals, due to arge elementsin W.

11.2.4. Uncontrollable ault states

A generalization of this approach in the case the state space model has states(‘fault states’) that are not controllable from the input ut and disturbance dtis presented in Nyberg and Nielsen (1997). Suppose th e st ate space model canbe written

Here x2 is not controllable from the inpu t and disturbance . Now we can split

0 = 01 0 2 )

in an obvious manner. The steps (11.6) and (11.7) can now be replaced by

WT 01 H d ) = 0

WT 0 2 Hf) 0.

The reason is th at we do not need to decouple the pa rt of th e st at e vectorwhich is not excited by U , d. In this way, th e number of candidate residualsthat is, the number of rows in wT ~ is much larger th an for the original designmethod. First , he null space of (0’ H d ) is larger th an for (0’ O2 H d ) ,and secondly, it is less likely tha t the null space is orthogonal t o 0 2 H f ) hanto H f in the original design. Intuitively, the ‘fault states’ x are unaffected byinput and disturbance decoupling and are observable from the output, whichfacilitates detection and isolation.



11.3 An observer amroach 41 3

11.2.5. Open problems

Th e freedoms in th e design are the sliding window size L and the transforma-tion matrix T in wT = T N . The trade-off in L in statistical methods does notappear in this noiseless setting. However, as soon as small measurement noiseis introduced, longer window sizes seem to be preferred from a noise rejectionpoint of view.

In some applications, it might be interesting to force certain columns inwT o zero, and in t hat way remove influence from specific sensors oractuators. This is the idea of unknown input observers and dedicatedobservers.

From a measurement noise sensitivity viewpoint, a good design giveselements in W of the same order. Th e examples in the next section showthat the elements of W might differ several order of magnitudes from astraightforward design. When L is further increased, the average size ofthe elements will decrease, and the central limit theorem indicates thatthe noise attenuation will improve.

11.3. An observer approach

Another approach to residual generation is based on observers. The followingfacts are important here:

We know that the state f a system per definition contains all informationabout the system, and thus also faults.

There is no better (linear) filter han the observer to compute the states.

Th e observer does not have to be of higher order tha n th e system ordern z . The so called dead-beat observer can be written as

i t = Cy(4)Yt + C U ( d U t ,

where Cg(q) nd CU(q) re polynomials of order n,. An arbitrary ob-server polynomial can be introduced to at te nu at e disturbances.

This line of arguments indicates th at any required residual can be computedby linear combinations of the states estimated by a dead-beat observer,

rt = L i t = A q)yt B(q)ut .

That is, the residual can be generated by an FIR filter of order n,. Thisalso indicates that th e largest sliding window which needs to be considered




is L = n,. This fact is formally proved in Nyberg and Nielsen (1997). Th edead-beat observer is sensitive to disturbances. Th e linear combination L canbe designed to decouple th e disturbances, and we end up at essentially the

same residual as from a design using parity spaces in Section 11.2.More specifically, let

be our candidate residual . Clearly, this is zero when there is no disturbanceor fault and when the in itia l transient has passed away. If th e observer stateswere replaced by the true states , then we would have

In that case, we could choose L such that th e disturbance is decoupled byrequiring L(BT, = 0. However, the observer dynamics must be ncludedand the design becomes a bit involved. The bo ttom line is th at th e residualscan be expressed as a function of th e observer sta te estimates, which are givenby a filter of order n,.

11.4. An nput-output approach

An inpu t-ou tput approach (see Frisk and Nyberg (1999)) to residual genera-tion, is as follows. Let the undisturbed fault-free system be

A residual generator may be taken as

which should be zero when no fault or disturbance is present. M q ) mustbelong to the left null space of W T ( q ) . Th e dimension of this null space andthus the dimension of rt is ng if there is no disturbances ( n d = 0) and less

otherwise.Th e order of the residual filter is L , so this design should give exactly the

same degrees of freedom as using the parity space.



11 .5 Amlications 41 5

11.5. Applications

One of most importan t applications for fault diagnosis is in automotive engines

(Dinca et al., 1999; Nyberg, 1999; Soliman et al., 1999). An application to a nunmanned underwater vehicle is presented in Alessandri et al. (1999). Herewe will consider systems well approximated with a linear model, in contrast toan engine, for instance. This enables a straightforward design and facilitatesevaluation.

11.5.1. Simulated DC motor

Consider a sampled sta te space model of a DC motor with continuous time

transfer function 1G s ) =

s(s +1)

sampled with a sample interval T = 0.4s. This is the same example usedthroughout Chapter 8 and the fault detection setup is th e same as in Example9.3.

The sta te space matrices with d being the angle and x he angular ve-locity are

A = 0 0.6703) ’ = (0.3297) ’ Bd = ( 3 Bf = k ) ’

c = k ) Du = ( 3 Dd = ( 3 D f = (0)

1 0.3297.0703

0 0

It is assumed tha t bo th X I and x are measured. Here we have assumed that afault enters as either an angular or velocity change in the model. Th e matricesin t he sliding window model (11.3) become for L = 2:

0 0 0 0o

1 0.3297 0.0703 00 0.6703 0.3297 0 0 1 0 0

-0.69300.1901.69300.0572= ( 0.04050.54660.0405.8354

State faults

A simulation study is performed, where the first angle fault is simulated fol-lowed by th e second kind of fault n angular velocity. The residuals usingwT = N are shown in the up per plot in Figure 11.1. The null space is trans-



41 6 Chanae detection based on alaebraical consistencv ests2nstructured residuals for = 2

I0 10 20 30 40 50 60 7 80

2Structured residuals for = 2

I- l I

l I0 10 20 30 40 50 60 7 80

Figure 11 l Residuals for sliding window L = 2. Upper plot shows unstructured residualsand lower plot s truc ture d residuals according t o the left tabl e in Table 11.1.

formed to get structured residuals according to th e pa tt er n in Table 11.1.That is, a fault in angle will show up as a non-zero residual r t , but the secondresidual r: remains zero. Th e da ta projection matrix becomes

t u T = ( 1 -0.3297 1 00 -0.6703 0 1

One question s what happens f the sliding window size is ncreased. Figure11.2 shows the structured and unstructured residuals. One conclusion is tha tthe null space increases linearly as L ny, but the rank of H f also increases asL n g , so the number of residual candidates does not change. Tha t is, a largerwindow size th an L = n, does not help us in the diagnosis.

Actuator and sensor faults

A more realistic fault model is to assume actuator and (for instance angular)sensor offsets. This means that we should use

In this way, ft = ( a t , O T means actuator offset of size at, and f t = (0, at )Tmeans angular sensor offset. However, these two faul ts are not possible to




Unstructuredesidualsor L = 3 Unstructuredesidualsor L = 4

: p j j = g J- 3

-1 b 10 20 30 40 5 60 70 ,b - b I 20 30 40 SO 60 70 JO

Structuredesidualsor L = 3 Structuredesidualsor L = 4

,

-1 b 10 20 30 40 5 60 70 ,b - b I 20 30 40 SO 60 70 JO

( 4 (b)

Figure 11.2. Residuals for sliding window L = 3 and L = 4 respectively. Upper plotsshow unst ructured residuals and lower plots struc tured residuals according t o th e left tab lein Table 11.1

isolate, though they are possible to detect. The only matrix tha t is influencedby the changed fault assumptions is

1 0 0

H f = .0703 0 0 O) .0.3297 0 0 0

Here we get a problem, because the second fault (columns 2 and 4 in H f ) areparallel with the first column in 0 . Th at is, a n angular disturbance cannotbe distinguished from the influence of the init ial filter stat es on the angle.

Increasing the window to L = 3 does not help:

H f =

11.5.2. DC motor

0 1 0 0 0 00 0 0 0 0 0

0.0703 0 0 1 0 00.3297 0 0 0 0 00.1790 0 0.0703 0 0 10.2210 0 0.3297 0 0 0

Consider the DC motor lab experiment described in Sections 2.5.1 and 2.7.1.It was examined with respect o system changes in Section 5.10.2, and a statis-tical approach to disturbance detection was presented in Section 8.12.1. Herewe apply the parity space residual generator to detect the torque disturbances




while being insensitive to system changes. Th e te st cases in Table 2.1 areconsidered. Th e s ta te space model is given in (2.2).

From Figure 11.3, we conclude the following:

Th e residual gets a significant injection at the time of system changesand disturbances.

These times coincide with the alarms from the Kalman filter residualwhiteness test. The latter method seems to be easier to use and morerobust, due to the clear peaks of the test s tatistics.

It does not seem possible to solve th e fault isolation problem reliably withthis method. See Gustafsson and Graebe (1998) for a change detection

method that works for this example.

Model residuals, data et 1

2odel residuals, data set

lTest statistics from whiteness chanae etedion

lTest statistics from whiteness chanae etedion

Model residuals. data et 3

l

Test statistics from wh iteness change det ion

0Residusls fmrn fault detedion

2

0

-2 ~

0 500 1000 1500000500

Figure 11.3. Simulation error using th e st at e space model n (2.2), test statistics fromthe CUSUM whiteness test of the Kalman filter for comparison and parity space residuals.Nominal system (a), with torque disturban ces (b), with change in dynamics (c) and bot hdisturba nce and change ( d).




11 5 . 3 . Vertical aircraft dynamics

We investigate here fault detection in th e vertical-plane dynamics of an F-16aircraft . The F-16 aircraft is, due to accurate public models, used in manysimulation studies. For instance, Eide and Maybeck (1995, 1996) use a fullscale non-linear model for fault detection. Related studies are Mehra et al.(1995), where fault detection is compared to a Kalman filter, and Maybeckand Hanlon (1995).

The dynamics can be described by a transfer function yt = G(q)ut or by asta te space model. These models are useful for different purposes, just a s theDC motor in Sections 2.5.1 and 2.7.1. One difference here is th at th e dynamicsgenerally depend upon the opera ting point.

The state, input s and outpu ts for this application are:Inputs outputs

2 3 : pitch angle [deg]3: pitch angle [d eg ]S : elevator angle [d eg ]2 2 : forward speed [ m / s ]z: forward speed [ m / s ]Z : forward acceleration [ m / s 2 ]2 1 : altitude [m]l: relative altitude [m]: poiler angle [O.ldeg]States

2 4 : pitch rate [ d e g l s ]2 5 : vertical speed [ d e g l s ]

Th e numerical values below are taken from Maciejowski (1989) (given incontinuous time) sampled with 10 Hz:

1 0.0014 0.1133 0.0004 -0.09970 0.9945 -0.0171 -0.0005 0.00700 0.0003 1.0000 0.0957 -0.00490 0.0061 -0.0000 0.9130 -0.09660 -0.0286 0.0002 0.1004 0.9879

-0.0078 0.0000 0.0003-0.0115 0.0997 0.0000

0.4150 0.0003 -0.15890.1794 -0.0014 -0.0158

B d = [ ) , [0.0078 0.0000 0.0003 0 0 0-0.0115 0.0997 0.0000 0 0 0

B f = 0.0212 0.0000 -0.0081 0 0 00.4150 0.0003 -0.1589 0 0 00.1794 -0.0014 -0.0158 0 0 0

l 0 0 0 "), (o 0 0 1 0 ")c = 0 1 0 0 0 f = 0 0 0 0 1 0 .0 0 1 0 00 0 0 0 1

The disturbance is assumed to act as an additive term to the forward speed.




Fault detection

The fault model is one component f i for each possible actuator and sensorfault. In other words, in the input-output domain we have

Yt = G 4) (%+ ($))+ 3)Using L = 5 gives a null space N being a 6 X 15 matrix. A &R-factorizationof Af = QR gives an upper-diagonal matrix R , whose last row has nine zerosand six non-zero elements,

R'''' = (O' , -0.0497, 0.7035, 0.0032, 0.0497, 0.7072, 0.033).

Since the orthogonal matrix Q cannot make any non-zero vector zero, we canuse wT = R. More specifically, we can take the last row of R to compute aresidual useful for change detection. This residual will only use the currentand past measurements of y, U , so the minimal order residual filter is thus oforder two.

Isolation

Suppose we want to detect and isolate faults inf

and f6. We use the struc-tural matrices

The projection matrix becomes

-32 0.05 0.760.107.64.07W~ = ( . . 13 -11 0.025.612 0 -0.18

1320.20 21 -1350.39 15 -790.26. . . -69 41 -0.067 6.73 40 0 28

Th e largest element is 132 (compare with (11.14)). One can also note th atthe rat io of th e largest and smallest elements is

w = 8023.min 1wI



11 .5 Amlications 421

Thus, he filter coefficients in W has a large dynamic range, which mightindicate numerical difficulties.

A simulation gives the residuals in Figure 11.4. Th e first plot shows un-

structured residuals and the second the structu red ones. The transient causedby th e filter looks nastier here, largely due t o th e higher order filters.

Isolation using minimum order filters

The transient problem can be improved on by using minimum order residualfilters. Using the QR factorization of N above, and only using the last tworows of R , we get

0 0 0 0 -0.0032 16 -76 0.11-55 153 0.12276 0 123) .

0 0 0 0 0.0064 -32 301 -0.46202 -5930.4631592 0 145

w T =

(Here th e largest element is 593 and the rat io of th e largest and smallest

element is 183666. Tha t is, the price paid for the minimal filter is much largerelements in W , and thus according to (11.14) a much larger residual variance.This explains the sensitivity t o noise tha t will be pointed out.

Th e order of the filter is 3, since the last 7 components of Yt correspondsto three measurements (here only the third component of yt-2 is used). Theresult is consistent with the formula (11.13),

L nd+ n , + n f 0 + 5 + 2

nY 3

The third plot in Figure 11.4 shows how th e transients now only last fortwo samples, which will allow quicker diagnosis.

Sensitivity to noisy measurementsSuppose we add very small Gaussian noise to th e measurements,

y F = yt + e t , Cov( e t ) = R.

Th e residuals in Figure 11.4 are now shown in Figure 11.5 for R = 1 O P 6 1 .This level on the noise gives an SNR of 5 . 106. We see th a t the residual filteris very sensitive to noise, especially t he minimal order filter.

Exactly as or other change detection approaches based on sliding windows,the noise attenuation improves when the sliding window increases at the costof longer delay for detect ion and solation. Figure 11.5 also shows the residualsin the case when the window length is increased to L = 10.




Unstructured residuals forL = 5

-1Structured residuals for L = 5

Structured residuals for = 5 and minimum order filter

- ‘O 10 20 30 40 50 60 70 80Time [samples]

Figure 11.4. Residuals for sliding window L = 5 for th e aircraft model. The upper plotshows unstructured residuals, and the middle plo t structu red residuals according to the efttable in Table 11.1. The lower plot shows minimal order residuals.

Unstructuredesidualsor L = Unstructuredesidualsor L = 102 I

-1Structured residuals for =

IStructured residuals for = 10

Structured residuals for = and minimum order f ilter Structured residuals for = 10 and minimum order filter

1 r-

.;;g0 80

(a) (b)

Figure 11.5. Illustra tion of the sensitivity to measurement noise. Residuals for slidingwindow L = 5 and L = 10, espectively, for th e aircraft model. The upper plot showsunstruc tured residuals, and the middle plot structured residual s according to the left tablein Table 11.1. The lower plot shows minimal order residuals.



11 .5 Amlications 423

Design based on correct model or L = 5

M o d e l e r r o r A ( 1 , 3 ) = 1 . 1 5 ( 1 . 1 3 3 2 ) f o r L = 1 02 I II I ’

I . 1 I 1

Figure 11.6. Illustration of the robu stne ss t o model errors. A design based on the correctmodel used in th e simulation gives structur ed residuals as in the irst subplot. The other woshows the stru ctur ed residuals when t he A 1 , 2 ) element is increased l , or sliding windowsizes L = 5 and L = 10, respectively.

Robustness to incorrect model

Th e noise sensitivity ndicates th at there might be robustness problems aswell. This is illustrated by changing one of the param eters in the example by1 . This is done by letting A(1,2) = 1.15 in the design instead of A(1,2) =

1.1332 as used when simulating the da ta . Th e result n Figure 11.6 showsthat small model errors can lead to ‘noisy’ residuals, and decision makingbecomes hazardous. Again, improved performance is obtained by increasingthe window size at the cost of increased delay for detection.



Part V: Theory





12Evaluation theory

12.1. Filter evaluation . . . . . . . . . . . . . . . . . . . . . . . 427

12.1.1. Filterdefinitions . . . . . . . . . . . . . . . . . . . . . . . 427

12.1.2. Performancemeasures. . . . . . . . . . . . . . . . . . . .

42812.1.3. MonteCarlo imulations . . . . . . . . . . . . . . . . . . . 42912.1.4.Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 43212.1.5. MCMC andGibbs ampler . . . . . . . . . . . . . . . . . 437

12.2. Evaluation of change detectors . . . . . . . . . . . . . . 439

12.2.1.Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43912.2.2.The ARL unction . . . . . . . . . . . . . . . . . . . . . . 441

12.3. Performance optimization . . . . . . . . . . . . . . . . . 444

12.3.1.The MDL criterion . . . . . . . . . . . . . . . . . . . . . . 44512.3.2.Auto-tuningandoptimization . . . . . . . . . . . . . . . . 450

12.1. Filter evaluation

12.1 l . Filter definitions

Consider a general linear filter (or estimator)

12.1)

as illustrated in Figure 12.1. Typically, the estimated quantity X is either theparameter vector 19in a parametric model, or the state in a state space model.It might also be the signal component st of the measurement yt = st + u t . Themeasurements zt consist of the measured outputs yt and, when appropriate,the inputs ut.

The basic definitions that will be used for a line ar $filterare:

eA t ime-invariant f i l terhas

at,i = u i

for all t and i.e A non-causal filterhas tl < 0. This is used in smoothing. If tl 2 0, the

filter is causal.





428 Evaluation heorv

Figure 12.1. An estimator takes the observed signal zt and transforms it to estimates & .

A causal filter is I IR ( Inf ini te Impulse Resp onse)if t 2 = c m otherwise itis FIR (Fini te Impulse Response) .

2t is a k-step ahead predictionif t l = k > 0.

Most filters use all pas t data, so t 2 = c m and he notation is sometimesuseful for highlighting tha t the estimato r is a predictor t l > 0), a smoother(tl < 0) or a filter (tl = 0) .

12.1.2. Performance measures

The estima tor gives a so called point estimate of X. For evaluation, we areinterested in the variability between different realizations:

One such measure is the covariance matrix

(12.2)

Here and in the sequel, the super-index O means the true value. Such acovariance matrix is provided by the Kalman filter, for instance, but itreflects the true variability only if the underlying signal model is correct,including its stochastic assumptions.

A scalar measure of performance is often to prefer when evaluating dif-ferent filters, such as the square root of the mean value of the norm ofthe estimation error

J t r o (E(& X: ' l 2 = d- (12.3)

where the subindex 2 stands for the 2-norm and tr or trace, which is thesum of diagonal elements of the matrix argument. This is a measure ofthe length of the estimation error . One can think of it as the stan darddeviation of the estimation error.

Sometimes the second order properties are not enough, and he completeProbability Density Function(PDF) needs to be estimated.



12.1 Filter evaluation 429

Confidence intervals for the parameters and hypothesis tests are otherrelated applications of variability measures.

Monte Carlosimulations offer a means for estimating these measures. Beforeproceeding, let us consider the causes for variability in more detail:

Variance error is caused by the variability of different noise realizations,which gives a variability in 2t E(&).

Bias and tracking error are caused by an error in the signal model (bias)and inability of the filter t o track fast variations in X: (tracking error),respectively. The combined effect is a deviation in z: E(&). Determi-nation of bias error is in general a difficult task and will not be discussed

fur ther in this section. We will assume that ther e s no bias, only trackingerror.

Hence, the covariance matrix can be seen as a sum of two terms,

W E(II2.t 11 ; = E(II2.t E(&)ll;) + E(IIE(2.t) . (12.4)variance rrorias andracking rror

12.1.3. Monte Carlo simulations

Suppose we can generate A 4 realizations of th e da ta t , by means of simulation

or data acquisition under th e same conditions. Denote them by zt , J =A .

1 , 2 , . . M . Apply the same estimator (12.1) to all of them,

From these M sets of estimates , we can estimate all kind of statistics. Forinstance, if the dimension of z is only one, we can make a histogram over &,which graphically approximates the PDF.

The scalar measure (12.3) is estimated by th e RootMeanSquareError( R M S E )

(12.6)

This equation comprises both tracking and variance errors. If the true valueX; is not known for some reason, or if one wants to measure only the varianceerror, X: can be replaced by its Monte Carlo mean, while at th e same time



430 Evaluation theorv

changing M to M 1 in the normalization (to get an unbiased estimate of thestandard deviation ), and we get

Equation 12.6) is an estimate of the standard deviation of the estimation errornorm at each time instant. A scalar measure for th e whole d at a sequence is

(12.8)

Note tha t the tim e average is placed inside the square root, in order to getan unbiased estimate of standard deviation. Some authors propose to timeaverage the RMSE( t), but this s not an estimate of standard deviation of theerror anymore.’

Example 72.1 Signal estimation

Th e first subplot in Figure 12.2 shows a signal, which includes a rampchange, and a noisy measurement yt = s t +et of it. To recover the signal fromthe measurements, a low-pass filter of Butterworth-type is applied. Usingmany different realization of the measurements, the Monte Carlo mean isillustrated in the last two subplots, where an estimated confidence bound isalso marked. This bound is the ‘ f d evel, where s replaced by RMSE(t).

In some applications, including safety critical ones, t is the peak errorthatis crucial,

(12.9)

From this, a scalar peak measure can be defined as the total peak (max overtime) or a time RMSE (time average peak value). Note tha t this measure isnon-decreasing with the number of Monte Carlo runs, so the absolute valueshould be interpreted with a little care.

According to Jensen’s inequality, E(,/ < m o this ncorrect procedure wouldproduce an under-biased estimate of standard deviation. In general, standard deviationsshould not be averaged. Compare the outcomes from ean(std(randn(l0,IOOOO))) withsqrt(mean(std(randn(lO,10000)).~2))




1.5

1 -

-0.5 0 50 100 150

-0.5 I0

I

50 100 150

.0 50 100

Time [samples]150

Figure 12.2. First plot shows th e signal (dashed line) and the measurements (solid line)from one realization. The second and t hir d plot show th e Monte Carlo mean estimate andthe one sigma confidence interval using the Monte Carlo standard deviation or 50 and 1000Monte Carlo runs, respectively.

Th e scalar performance measures can be sed for auto-tuning. Suppose thefilter is parameterized in a scalar esign parameter. As argued in Chap ter 1, alllinear filters have such a scalar to trade-off variance and tracking errors. Thenwe can optimize the filter design with respect to this measure, for instanceby using a line search. The procedure can be generalized to non-scalar designparameters, at the cost of using more sophisticated and computer intensiveoptimization routines.

Example 72.2 Target tracking

Consider the target tracking example in Example .1. For a particular filter

(better tuned than the one in Example 1.1), th e filter estimates lie on top ofthe measurements, so Figure 12.3(a) is not very practical for filter evaluation,because it is hard to see any error at all.

The RMSE position error n 12.3(b) is a much better ool for evalua-tion. Here we can clearly see th e three important phases in adaptive filtering:transient (for sample numbers 1-7), the variance error (8-15 an d 27-32) an dtrucking error (16-26 and 35-45).

All in all, as long as it is possible to generate several da ta sets under thesame premises, Monte Carlo techniques offer a solution to filter evaluationand design. However, this might not be possible, for instance when collecting




10

RMSE for one realization

3

2.5

2 5 5000

1.5 0Time [samples]

RMSE or 10 realizations1 I

104

( 4 (b)Figure 12.3. Radar traj ecto ry (circles) and filter estimated positions (crosses) in (a). TheRMSE position error in (b).

da ta during one single test trial, where it is either impossible to repea t thesame experiment or too expensive. One can then try resampling techniques,as described in the next subsections.

Bootstrap offers a way to reorder the filter residuals and make artificial

simulations tha t are similar to Monte Carlo simulations. The procedureincludes iterative filtering and simulation.

Gibbs resampling is useful when certain marginal distributions can beformulated mathematically. Th e solution consists n terative filteringand random number generation.

12.1.4. Bootstrap

Static caseAs a non-dynamic example of the bootst rap technique, consider the case ofestimating distribution parameters in a sequence of independent identicallydistributed (i.i.d.) stochastic variables,

where Q are parameters in the distribution. For instance, consider the locationQ and scale oY parameters in a scalar Gaussian distribution. We know fromSection 3.5.2 how to estima te them, and there are lso quite simple theoreticalexpressions for uncertainty in the estimates. In more general cases, it mightbe difficult, if not impossible, to compute the sample variance of th e point




estimate, since most known results are asymptotic. As detailed n Section12.1.1, the standard approach in a simulation environment is to perform MonteCarlo simulations. However, real da ta sets are often impossible or too costly o

reproduce under indentical conditions, and there could be too few data pointsfor the asymptotic expressions t o hold.

Th e key idea in bootstrap is to produce new artificial da ta sets by pickingsamples at and om from the set yN with eplacement. That is, from themeasured values 4 ,2 ,3 ,5 ,1 we might generate 2, 2, 5, 1, 5. Denote he newdata set by Y ~ ) ~ ) .ach new data set is used to compute a point estimate8 2). Finally, hese stimates are reated s independent outcomes romMonte Carlo simulations, and we can obtain different variability measures asstanda rd deviation and confidence intervals.

Example 12.3 Boorsrrap

Following the example in Zoubir and Boushash (1998), we generate 10samples from N(10,25). We want to estimate the mean I9 and its variance Pin the unknown distribution for the measurements yt. Th e point estimate of8 is

10

I9= Cy1

10t

t = l

and a point estimate of its variance P = Var 8) is

110

i G C ( y t 8 ?t = l

We can repeat this experiment a number of times (here 20) in a Monte Carlosimulation, and compute the variability of in the different experiments asthe Monte Carlo estimate of the variance, or can use the 10 data we have anduse boots trap to generate Monte Carlo like data. Note that its theoreticalvalue is known in this simulation to be 25/10 = 2.5. The result is as follows:

Statistics Point estimate Monte Carlo bootstrap Theoretical

9.2527 10 9.3104 10.0082

I Var(4) I 2.338 I 2.1314 I 2.3253 I 2.5

The result is encouraging, in th at a good estimate of variability is obtained(pretend that th e theoretical value was not available). A new realization ofrandom numbers gives a much worse measure of variability:




Statistics

10.8114 9.9806 8.7932(8)

Theoreticalootstraponte Carlo Point estimate

Finally, one might ask wha t th e performance is as an average. Twenty re-alizations of the tables above are generated, and the mean alues of variabilityare:

Statistics

2.5 2.40 2.53 2.65Var(8)

Theoreticalootstraponte Carlooint estimate

In conclusion, boo tstr ap offers a good alternative to Monte Carlo simula-tions or analytical point estimate.

From the example, we can conclude th e following:

The boo tstr ap result is as good as the natura l point estimate of vari-ablity.

Th e result very much depends upon the realization.

That is:

BootstrapBoots trap cannot create more information han is contained in themeasurements, bu t it can compute variability measures numeri-cally, which are as good as analytically derived point estimates. I

There are certain applications where there is no analytical expression forthe point estimate of variablity. As a simple example, the variance of thesample median s very hard to compute nalytically, and thus a point estimatorof median variance is also hard to find.

Dynamic time invariant case

For more realistic signal processing applications, we first st art by outliningthe time invariant case. Suppose th at th e measured dat a are generated by adynamical model parametrized by a parameter 19,

The bootstrap idea is now as follows:




1. Compute a point estimate rom the original da ta set {yi, ui}El.

2. Apply the inverse model (filter) and get the noise (or, more precisely,residual) sequence eN.

3. Generate boots trap noise sequences G N ) 2 )by picking random samplesfrom eN with replacement.

4. Simulate the estimated system with these artificial noise sequences:

Yt = f(Yt-17 Yt-2 ut-17 * * * et e)*4 .

5. Compute point estimates 8 i ) ,and treat them as independent outcomesfrom Monte Carlo simulations.

Example 72.4 Boorsrrap

Consider the auto-regressive AR 1)) model

N = 10 samples are simulated. From these, th e ML estimate of th e parameter6 in the AR model gt + 6 yt-l = et is computed. (In MATLABTM his is doneby -y l N - l ) \y(2 : N ) ; ) Then, compute the esidual sequence

Generate bootstrap sequences ( d N ) ( i ) and simulate new d at a by

Finally, estimate 6 for each sequence T J ~ ) ~ ) .

A histogram over 1000 bootstrap estimates is shown in Figure 12.4. Forcomparison, the Monte Carlo estimates are used to approximate the true PD Fof the estimate. Note that the PDF of the es tima te is well predicted by usingonly 10 samples and bootstrap techniques.

The tab le below summarizes th e accuracy of t he point estimates for thedifferent methods:

Statistics Theoreticalootstraponte Carlooint estimate

E(6') 0.7 0.57.60.57

Std(8) 0.23 0.30 0.27.32




Estimated 8 from MC

1L

120Estimated e from BS

100 -

80

60 n

5

Figure 12.4. Histograms for point estimates of an AR parameter using (a) Monte Carlosimulations and (b) bootstrap.

As in the previous example, we can average over, say, 20 tables t o averageout the effect of th e shor t da ta realization of only 10 samples on which thebootstrap estimate is based. The able below hows that bootstrap givesalmost as reliable estimate as a Monte Carlo simulation:

Statistics

0.22583 0.31 0.28 0.2td(0)

Theoreticalootstraponte Carlooint estimate

The example shows a case where t he theoretical variance and standa rddeviation can be computed with a little effort. However, for higher order ARmodels, finite da ta covariance expressions are har d to com pute analytically,and one has to rely on asymptotic expressions or resampling techniques asbootstrap.

Dynamical time varying case

A quite challenging problem for adaptive filtering is to compute the RMSE asa function of time. As a general problem formulation, consider a sta te spacemodel

where P = Cov(2.t) is sought. Other measures as RMSE can be expressedin terms of P . For many applications, it is quite obvious th at th e covariancematrix Pt delivered by the Kalman filter is not reliable. Consider the target




tracking example. The Kalman filter innovations are certainly not white afterthe manoeuvres. A natura l generalization of th e bootst rap principle is asfollows:

1. Inverse filter the state space model to get an estimate of the two noisesequences {6t} and {&}. This includes running the Kalman filter to getthe filtered estimate , and then compute

The estimate of ut might be in th e least squares sense if B, is not fullrank.

2. If an estimate of t he variance contribution to the RMS E(t ) is to befound, resample only the measurement noise sequence { }(i). Other-wise, resample both sequences. However, here one has to be careful. Ifthe sequence {6t} is not independent and identically distributed 2.2.d ) ,which is probably the case when using real data, and the bootstrap deadoes not apply.

3. Simulate the system for each set of bootstrap sequences.

4. Apply the Kalman filter and treat the state estimates as Monte Carlooutcomes.

Literature

An introduction to the mathematica l aspec ts of boo tstr ap can be found inPolitis (1998) and a survey of signal processing applications in Zoubir andBoushash (1998). An overview for system identification an d an application touncertainty estimation is presented in Tja rns trom (2000).

12.1.5. MCMC and Gibbs sampler

As a quite general change detection and Kalman filter example, consider thesta te space model

~ t + 1 A x ~ +&ut +C S t - k j B f . f j .

j

gt =Cxt

+Dut

+et,

where fj is a fault, assumed to have known (Gaussian) distribution, occuringat times k j . Let K be the ra ndom vector with change times, X the random




matrix with sta te vectors, and Y th e vector of measurements. The changedetection problem can be formulated as computing the marginal distribution

(12.10)

The problem is tha t the integral in (12.10) is usually qu ite hard to evaluate.The idea in Markov Chain Monte Carlo( M C M C )and Gibbs sampling is togenerate sequences X ( Z ) ,K ( i ) hat asymptotically will have the marginal dis-tribution (12.10). The Gibbs sequence is generated by alternating betweentaking random samples from th e following two d istributions:

Comments on these steps:

The distribution (12.11) is Gaussian with mean and covariance computedby the Kalman smoother. That s, here we run the Kalman smoother andthen simulate random numbers rom a multi-variate normal distribution

N(Q-l7 Q - 1 ) .

From the Markovian property of the state, the distr ibution (12.12) fol-lows by noting th a t the probability for a change at each time is givenby comparing the difference xt+l- Axt B,ut with a Gaussian mixture(linear combination of several Gaussian distributions).

The recursion has two phases. First, it converges from the initial uncerta intyin X ( ' ) and K( ' ) . This is called the burn- in ime. Then, we continue toiterate until enough samples from th e marginal distribution (12.10) have been

obtained.In some cases, we can draw samples from the multi-variate distribution of

the unknowns X and K directly. Then we have MCMC. Otherwise, one candraw samples from scalar distributions using one direction after each otherand using Bayes' law. Then we have Gibbs sampling.

That is, the basic idea in Gibbs sampling is to generate random numbersfrom all kind of marginal distributions in an iterative manner, and wait untilthe samples have converged to the tr ue distribution. The drawback is tha tanalytical expression for all marginal distributions must be known, and thatthese change from application to application.

The theoretical question is why and how th e samples converge to th emarginal distribution. Consider the case of Kalman filtering, where we want



12 .2 Evaluation of chanae detectors 439

I Phange detector

Figure 12.5. A change detector takes the observed signal t and delivers an alarm signal a t

which is one at time t , and zero otherwise, and possibly makes a decision about the changetime k .

to compute certain statistics of the states X . The transition from one iterationto the next one can be written

Expressed in the initial state estimate, the recursion can be written

This is a Markov model, where the integral kernel h d 2 + ’ ) , d i ) )cts like thetransition matrix for state space models. One stationary point of this recursionis the marginal distribution of X , and it can be shown tha t th e recursion willalways converge to this point, if the marginal distribution exists.

Literature

Th e book by Gilks et al. (1996) covers MCMC. An ntroduction to Gibbssampling can be found in Casella and George (1992) and Smith and Roberts

(1993), and applications t o Kalman filtering problems in Shephard an d Pit t(1997) and Carter and Kohn (1994).

12.2. Evaluation of change detectors

12.2.1. Basics

A change detector can be seen as a device th a t takes an observed signal zt anddelivers an ala rm signal a t , as illustrated in Figure 12.5. It can also be seen asa device that takes a sequence of d at a and delivers a sequence of alarm timest , and estimated change times kn.

h

On-line performance measures are:



440 Evaluation heory

e Mean Time between False Al ar m s (M TF A)

MTFA = E(t, o [nohange), (12.13)

where t o is the star ting time for the algorithm. How often do we getalarms when the system has not changed? Related to MTFA is the falsea la rm ra te (FAR)defined as l/MTFA.

e Mean Time to De te ction ( M T D )

MTD = E ( t , kla given change at ime k ) . (12.14)

How long do we have to wait after a change until we get th e alarm?Another name for MTD is Delay For Detect ion (D F D ).

e Missed Detec t ion Rate ( M D R ) .What is the probability of not receivingan alarm, when there has been a change. Note th a t in practice, a larget , k can be confused with a missed detection in combination with afalse alarm.

e The Average Run Lengthfunction, A R L ( 0 )

ARL = E(ta kla change of magnitude 8 at ime k ) . (12.15)

A function that generalizes MTFA and MTD. How long does t akebefore we get an alarm after a change of size O? A very large value ofthe ARL function could be interpreted as a missed detection being quitelikely.

In practical situations, either MTFA or MTD is fixed, and we optimize thechoice of method and design parameters to minimize the othe r one.

In 08-line applications (signal segmentation), logical performance mea-sures are:

Estimation accuracy. How accurate can we locate the change times?This relates to minimizing the absolute value of MTD. Note th at th etime to detection can be negative, if a non-causal change detector isused.

The Minimum Desc r ip tion LengthMDL); see Section 12.3.1. How muchinformation is needed to store a given signal?

The latter measure is relevant in dat a compression and communication areas,where disk space or bandwidth is limited. MDL measures the number ofbinary digits that are needed to represent the signal and segmentation is oneapproach for making this small.




5 -

4 -

3 -

2 -

1 -

0

-1 -

I10 20 30 40 50 60

EO

70

60

50

40

30 n

(a) (b)

Figure 12.6. One realization of the change in the mean ignal (a), and histogram over alarmtimes from the CUSUM algorithm for 250 Monte Carlo simulations (b).

Example 72.5 Signal estimation

Consider the changing mean in noise signal in Figure 12.6. The two-sidedversion of the CUSUM Algorithm 3.2 with threshold h = 3 and drift v = 0.5is applied to 250 realizations of th e signal. The alarm times are shown in thehistogram in Figure 12.6.

From the empirical alarm times, we can compute the following statistics,under the assumption hat alarms come within 10 samples after the ruechange, otherwise an alarm is interpreted as a false alarm:

MTDl0.01.27.0.7FARDR2DRlTD2

As can be expec ted, the MTD and MDR are larger for the smaller firstchange.

Here we have compensated the missed detection rate by the estimatedfalse alarm rate. (Otherwise, the missed detect ion rate would become negativeafter the second jump, since more than 250 alarms are obtained in the interval[40,50].) The delay for detection is however not compensated.

12.2.2. The ARL function

Th e ARL function can be evaluated from Monte Carlo simulations, or otherresampling techniques. In some simple cases, it can also be computed numer-ically without the need for simulations.




The ARL function for a stopping rule (or rather alarm rule) for detectinga change Q in the mean of a signal is defined as

ARL = E(t,la change of magnitude I at time t = 0), (12.16)

where t, is the alarm time (or stopping time).In this section, we demons trate how to analyze and design CUSUM detec-

tors. Assume we observe a signal xt which is N 0 ,02 before the change andN ( 0 , a 2 ) after the change. We apply th e CUSUM algorithm with threshold hand drift v:

gt%-l +xt v (12.17)

gt =0, if gt<

0 (12.18)

gt =O, and t , = t and larm if gt > h > 0. (12.19)

A successful design requires Q > v.

There are two design parameters in th e CUSUM test: the threshold h , andthe drift parameter v. f s the standa rd deviation of th e noise, then it canbe shown tha t the functional form of the ARL function is

ARL(I3; h, v ) = f ( F . (12.20)

That is, it is a function of two arguments.Th e exact value of the ARL function is given by a so-called F'redholm

integral equation of the second kind, which must be solved by a numericalalgorithm. See de Bruyn (1968) or Section 5.2.2 in Basseville and Nikiforov(1993). Let p = Q . A direct approximation suggested by Wald is

(12.21)

and another approximation suggested by Siegmund is

e-2(hlu+1.166)fi/u1 + 2(h/o + 1.166)p/oARL =

2/3/02(12.22)

Th e quality of these approximations is investigeted in th e following example.

Example 72.6 ARL

Assume that a = 1 and h = 3. The mean time between false alarms isf ( h , -v) and the mean time for detection is f ( h , Q v), ee (12.20). Samplevalues of .f are:




Distributionof false alarm times Distr ibutionof de lay fo r de tec t ion

~

U

200 4 60000 1000Alarm t ime [ samples]

120

100-

8

60 -

40 -

20

O0

Figure 12.7. Distribution of false alarm times (a) and delay for detections (b), respectively,from Monte Carlo simulations with h = 3 and v = 0.5.

15 v MCiegmundaldheoretical

-0.50 119.9640 118.6000 32.2000 127.70001 0 I 19.4000 I 9.0000 I 17.4000 I 16.7620 I

0.50003.7600 3.7000 2.5000 3.906.4620 6.4000 4.1000 6.70

1.50002.1740 2.0000 1.4000 2.102.6560 2.6000 1.8000 2.70

l I I I I I

This is a subset of Table 5.1 in Basseville and Nikiforov (1993). See alsoFigures 12.7 and 12.8.

The mean times do not say anyth ing about the distribution of the runlength, which can be quite unsymmetric. Monte Carlo simulations can beused for further analysis of the ru n length function.

It can seen th at th e distribution of false alarms is basically binominal, and

What is really needed in applications is to determine h from a specifiedthat a rough estimate of the delay for detection is h/(B v).

FAR or MTD. That is, to solve the equation

1FAR

ARL(0; h, V ) =

or

ARL(B;h,v) = MTD(I5')




Ave ra ge run l e ng th [ s a mple s ] : rom s i rnu la tions a nd the o ry

- - Wa l d s a pprox .. . . . . .

Figure 12.8. ARL function as a function of ,U = 0 for threshold h = 3

with respect to h. Since the ARL function is a monotonously increasing func-tion of h, this should not pose any problems. Next, th e drift v can be optimizedto minimize the MTD if the FAR is to be held constant. That is, a very sys-tematic design is possible if th e ARL function is computable, and if there isprior knowledge of the change magnitude 8. To formalize this procedure ofminimizing MTD for a given FAR, consider the ARL function in Figure 12.9.First we take 8 = 0 (no change), and find the level curve where ARL=FAR.This gives the threshold as a function of drift, h = h v). Then we evaluate theARL function for the change 8 for which the MTD is to be minimized. Th eminimum of the function ARL(8; h(v), ) defines the optimal drift and thenalso threshold h v).

Th e ARL function can only be computed analytically for very simple cases,but the approach based on Monte Carlo simulations is always applicable.

12.3. Performance optimization

For each problem at hand, there are a number of choices to make:

1. Adaptive filtering with or without change detection.

2. The specific algorithm for filtering and possibly change detector

3. Th e design parameters in the algor ithm chosen.



12 .3 PerformanceDtimization 445

h 1 -1 e-v

Figure 12.9. ARL function for different approximations as a function of p = 6’ andthreshold h. Siegmund’s approximation is used to compute log,,(ARL).

We describe here in general terms methods for facilitating the design process.One particular goal is a more objective design than just using trial and errorcombined with visual inspection.

12.3.1. The MD1 criterion

In this section, we consider the problem of transmitting or storing a signal asefficiently as possible. By efficiently, we mean using as few binary digits aspossible. As a tool for this , we can use a mathematical model of the signal,which is known to the receiver or the device that reads the stored information.In the following, we will refer only to the transmission problem. Th e pointwith the mathematica l model is th at we do not transmit the signal tself,

bu t r ather the residuals from the model whose size is of considerably smallermagnitude if th e model is good, and thus fewer bits are required for atta iningthe specified accuracy a t the receiver. The prize we have to pay for this isthat the param eters in the model need t o be transmitted as well. Th at is, wehave a compromise between sending as small residuals as possible and usingas few parameters as possible. An implicit trade-off is the choice of how manydecimals th at are needed when transmitting the parameters. This last trade-off is, however, signal independent, and can be optimized for each problemclass leading to an optimal code length. The presentation here is based on

Gustafsson (1997a), which s an at temp t to a generalization of Rissanen’swork on this subject for model order selection..

More specifically, the problem can be stated as choosing a regression vector




(model structure) vt and parameter vectors Bt (constant, piecewise constantor time-varying) in the signal model

Yt =

P34e t ,

where et is the residual.The problem classes we will examine are listed below:

Time-invariant models where different model structures can be com-pared.

Time-varying models where different model structures, recursive identi-fication methods and their design parameters can be compared.

Piecewise constant models, where different model structures , change de-tection algorithms to find the change points and their design parameterscan be compared.

Besides the residuals th at always have to be transmitted, the first model re-quires a parameter vector to be transmitted . The second model transmits thetime update of the parameter vector, whose size should increase as the timevariations in the model increase. The third model requires the change pointsto be transmitted together with a parameter vector for each segment. Clearly,

the use of too many change points should be penalized.The first approach has been thoroughly examined by Rissanen; see, for in-stance, Rissanen (1978, 1982). He developed th e Minimum Desc r ip t ion Length( M D L )criterion, which is a direct measure of t he number of bits that areneeded to represent a given signal as a function of the number of parameters,the number of da ta an d the size of the residuals. We will extend t he MDLcriterion for the latt er two cases. Th e point here is th a t we get a n answerto not only what the most appropr iate model structure is, but also when itpays off to use recursive identification and change detection. Another point

is th at the design variables can be optimized automatically as well, which isimportant since this is often a difficult tuning issue.

We would like to point out the following:

There is no assumption tha t the re is a true system which has constant,time-varying or piecewise constant parameter; rather we are looking foran algorithm that is able to describe dat a as well as possible.

Th e generalized MDL can be used for standa rd sys tem identificationproblems, just like MDL is often used for choosing the model structure.For instance, by taking a typical realization of a time-varying system weget a suggestion on which recursive identification algorithm to apply andhow the design parameters should be chosen.




As will be shown, the result of minimizing the description length yieldsa stochastic optimality in the maximum likelihood meaning as well.

Time invariant parametersThe summary of MDL below essentially follows the introduction section ofRissanen (1982). Assume the measured signal y t is modeled in a parametricfamily with measurement noise 02 et 8) denote the code ength forthe signal y N = (91, y2, ...,Y N ) using a model with a &dimensional parametervector 8. Here = ~ 1 ,.., E N ) denotes th e set of prediction errors from themodel. In a linear regression framework, we have

E t =Yt

but other model structures are of course possible.Generally, we have

L(EN, 8) = ogp(EN, e ,

where 8) is the joint istribution of da ta nd he parameters. Thisexpression is optimized over th e precision of the value of 8, so th a t each elementin the parameter vector can be represented by an integer, say n. The codelength of this integer can be expressed as og(p(n)) for a suitable choice ofdensity function. Rissanen now proposes a non-informative prior for integersas

p (n ) - logC( ),

wherelog* ( n ) = log n + og log n + log log log + .

where the sum is terminated at the first negative term. It can be shown th atthis is a proper distribution, since its sum is finite.

With this prior, the optimal code length can be written

(12.23)

where only the fastest growing penalty term is included. Here C ( k ) is thevolume of the unit sphere in R k , and ll13llp = B T F 1 1 3 . For linear regressionswith Gaussian noise we have

1

0 2

N

L(EN, ) = c + log 11811$ + dlog N (12.24)t=l




The most common reference to MDL only includes the first two terms, whichare also scaled by 1/N. However, to be able to compare different assumptionson parameter variations we keep the th ird te rm for later use.

Piecewise constant parameters

As a motivation for this approach, consider the following example.

Example 72.7 GSM speech coding

The GSM standa rd for mobile telephony says th at th e signal is segmentedin batches of 160 samples, in each segment an eighth order AR model is esti-

mated, and the parameter values (in fact non-linear transformation of reflec-tion coefficients) and prediction errors (or ra ther a model of the predictionerrors) are transmitted to th e receiver.

This is an adequate coding , since typical speech signals are short-timestationary. Note that the segmentation is fixed before-hand and known to th ereceiver in GSM.

We consider segmentation in a somewhat wider context, where also thetime points defining the segmentation are kept as parameters. Th at is, theinformation needed to transm it comprises the residuals, the par ameter vectorin each segment and he change points. Related segmentation approachesare given in Kitagawa and Akaike (1978) and Djuric (1992), where th e BICcriterion Akaike (1977) and Schwartz (1978) is used. Since BIC is th e sameas MDL if only the fastest growing penalty term is included, the criteria theypresent will give almost identical result as th e MDL.

If we consider the change points as fixed, th e MDL theory immediatelygives

because with a given segmentation we are facing n independent coding prob-lems. Note that the number of parameters are n d , so d in the MDL criterionis essentially replaced by nd. The last term is still negligible if N >> n.

Th e remaining question is what the cost for coding the integers kn is. Onecan argue that hese integers are also parameters leading to the se of n ( d + 1)in MDL, as done in Kitagawa and Akaike (1978). Or one can argue thatcode length of integers is negligible compared to the real-valued parameters,leading to MDL with kn parameters as used in Djuric (1992). However, th e




description length of these integers is straightforward to compute. Bayes' lawgives th at

p ( & N , , k ) = p ( & N , lk )p (k ) .

Th e code length should thus be increased by og(p(k )). The most reason-able prior now is a flat one for each k . Th at is,

where we have assumed that th e number of d a ta is much larger than thenumber of segments. This prior corresponds t o th e code length

L ( k )= nlog N

That is, the MDL penalty term should in fact be n k + 1) og(N)/N,

10 2

N

L(EN, , ) M -c 2 t ) + d + 1) n og N . (12.25)t = l

Time varying parameters

Here we consider adaptive algorithms tha t can be written as

For linear regression, the update can be written

which comprises RLS, LMS and the Kalman filter as special cases.As a first try, one can argue that the parameter upda te AQt is a sequenceof real numbers just like th e residuals and th at th e MDL criterion should be

(12.26)

where Pa is the covariance matrix of the update A . This criterion exhibitsthe basic requirements of a penalty term linear in the number of parameters.Clearly, there is a trade-off between making the residuals small (requiring largeupdates if th e underlying dynamics are rapid ly time-varying) and making theupdates small.




12.3.2. Auto tuning and optimization

Once we have decided upon the evaluation measure, auto-tuning of the designparameters is a straightforward optimization problem. For a simple algorithmwith only one design parameter, a line-search is enough. More interestingly,we can compare completely different algorithms like linear adaptive filters withchange detection schemes. See Section 5.7 for an example of these ideas.



13Linear estimation

13.1. Projections . . . . . . . . . . . . . . . . . . . . . . . . . . 451

13.1.1.Linear lgebra . . . . . . . . . . . . . . . . . . . . . . . . 45113.1.2.Functional analysis . . . . . . . . . . . . . . . . . . . . . . 45313.1.3.Linear estimation . . . . . . . . . . . . . . . . . . . . . . . 45413.1.4.Example: derivation of the Kalman filter . . . . . . . . . 454

13.2. Conditional expectations . . . . . . . . . . . . . . . . . . 56

13.2.1.Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45613.2.2.Alternative optimality nterpretations . . . . . . . . . . . 45713.2.3.Derivation of marginal distribution . . . . . . . . . . . . . 45813.2.4.Example: derivation of the Kalman filter . . . . . . . . . 459

13.3. Wiener filters . . . . . . . . . . . . . . . . . . . . . . . . . 460

13.3.1.Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460

13.3.2.The non-causal Wiener ilter . . . . . . . . . . . . . . . . 46213.3.3.The causal Wiener filter . . . . . . . . . . . . . . . . . . . 46313.3.4.Wiener ignal predictor . . . . . . . . . . . . . . . . . . . 46413.3.5.An algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 46413.3.6.Wiener measurement predictor . . . . . . . . . . . . . . . 46613.3.7.The stationary Kalman smoother as a Wiener filter . . . . 46713.3.8.A numerical example . . . . . . . . . . . . . . . . . . . . . 468

1 3.1. Projections

The purpose of this section is to get a geometric understanding of linear es-timation . Firs t. we outline how projections are computed in linear algebrafor finite dimensional vectors . Functional analysis generalizes this procedureto some infinite-dimensional spaces (so-called Hilbert spaces). and finally. wepoint ou t tha t linear estimation is a special case of a n infinite-dimensionalspace . As an example. we derive th e Kalman filter .

13.1.1. lin ear algebra

The theory presented here can be found in any textbook in linear algebraSuppose that X, are two vectors in Rm We need th e following definitions:





452 Linear

The scalar product is defined by (X, y) = Czl xiyi. Th e scalar productis a linear operation in data y.

Length is defined by the Euclidean norm llxll =

d m .Orthogonality of z and y is defined by (X, y) = 0:

LyTh e projection z p of z on y is defined by

Note th at z p X is orthogonal to y, (xp X , y) = 0. This is the projectiontheorem, graphically illustrated below:

X

The fundamental idea in linear estimation is to project the qua ntit y to beestimated onto a plane, spanned by the measurements IIg. The projectionz p E I I y, or estimate 2 , of z on a plane I tg is defined by (xP X , yi) = 0 forall yi spanning the plane

II,: X

I Xp I

We distinguish two different cases for how to compute xp:

1. Suppose ( ~ 1 , 2 ,.., E N ) is an orthogonal basis for IIg. Th at is, ~ i , j ) 0for all i j and span(e1, 2 ,.., E N ) = IIg. Later on, E t will be interpreted



13.1 Proiections 453

as the innovations, or prediciton errors. Th e projection is computed by

Note that the coefficients f i can be interpreted as a filter. The projection

theorem ( z P X , j ) 0 for all j now follows, since xP, ~ ) X, ~ ) .

2. Suppose that the vectors ( y l ,y2, ...,Y N ) are linearly independent, but notnecessarily orthogonal, and span the plane IIy. Then, Gram-Schmidtorthogonlization gives an orthogonal basis by the following recursion,initiated with €1 = y1,

and we are back in case 1 above:

Y1 = E 1

13 1 2 Functional analysis

A nice fact in functional analysis s that the geometric relations in he previous

section can be generalized from vectors in E m to infinite dimensional spaces,which (although a bit sloppily) can be denoted Em. This holds for so-calledHalbert spaces, which are defined by th e existence of a scalar product with

1. z, z) 0 for all IC 0. That is, there is a length measure, or norm,

that can be defined as l lz l l ( x , x ) ~ / ~ .

From these propert ies, one can prove the triangle inequality ( x + Y, X +y) l j2 I( I C , I C ) / ~( y, y ) I I 2 and Schwartz inequality I(x,y)l 5 l l z l l . Ilyll. See, forinstance, Kreyszig (1978) for more details.



454 Linear

13.1.3. linear estimation

In linear estimation, the elements z and y are stochastic variables, or vectorsof stochastic variables. It can easily be checked that the covariance defines ascalar product (here assuming zero mean),

which satisfies the three postulates for a Hilbert space.A linear filter tha t is optimal in he sense of minimizing the 2-norm implied

by the scalar product, can be recursively implemented as a recursive Gram-Schmidt orthogonalization and a projection. For scalar y and vector valued z,the recursion becomes

Remarks:

This is not a recursive algorithm in the sense that the number of com-putations and memory is limited in each time step. Further application-specific simplifications are needed to achieve this.

To get expressions for the expectations, a signal model is needed. Basi-cally, this model is the only difference between different algorithms.

13.1.4. Example: derivation of the Kalman filter

As an illustration of how to use projections, an inductive derivation of theKalman filter will be given for the state space model, with scalar yt,

1. Let the filter be initialized by olo ith an auxiliary matrix Polo.

2. Suppose that the projection at time t on the observations of ys up to

time t is Z t l t , and assume that the matr ix Ptlt is the covariance matrixof the estimation error, Ptlt = E(ZtltZ ).



13.1 Proiections 455

3. Time update. Define the linear projection operator by

Then

=AProj(stIyt) +Proj(B,vtIyt) = A2 .

O

Define the estimation error as

which gives

Measurement update. Recall the projection figure

X

I Xp I

and the projection formula for an orthogonal basis



456 Linear

Th e correlation between xt and E t is examined separately, using (accord-ing to the projection theorem) E(2tlt-lZtlt-l) = 0 and ~t = yt-C2tlt-l =

CQ-1 + et :

Here we assume that xt is un-correlated with et. We also need

Th e measurement update of the covariance mat rix is similar. All to-gether, this gives

The induction is completed.

13 2 Conditional expectations

In this section, we use arguments and results from mathematical statistics.Stochastic variables (scalar or vector valued) are denoted by capital letters, todistinguish them from the observations. This overview is basically taken from

Anderson and Moore (1979).

13 2 1 Basics

Suppose the vectors X and Y are simultaneously Gaussian distributed

Then, the conditional distribution for X , given the observed Y = y, is Gaus-

sian distributed:



13 .2 ConditionalxDectations 457

This follows directly from Bayes’ rule

by rather ted ious computations. The complete derivat ion is given in Section13.2.3.

The Conditional Mean (CM) estimator seen as a stochastic variable canbe denoted

while the conditional m ean estimate, given the observed y, is

= E(XIY = y) = px + PxyP;;(y p y .

Note that the estimate is a linear function of y (or rather, affine).

13 2 2 Alternative optimality nterpretations

The M a x i m u m A Pos te r io r i (MAP)estimator, which maximizes the Probabil-i ty Density FunctionPDF) with respect t o X , oincides with the CM estimato r

for Gaussian distributions.Another possible estimate is given by the Condit ional Minimum Variance

principle ( C M V )

2c MV( y) = argminE(1IX ~(y)11~IY y).4Y )

It is fairly easy to see th at the CMV estimate also coincides with the CMestimate:

minimum variance

This expression is minimized for x (y ) = 2(y), and the minimum variance isthe remaining two terms.



458 Linear

The closely related (unconditional) Minimum Variance principle ( M V )de-fines an estimator (no te the difference between estimator and estimate here):

X M V ( Y ) arg min EyEx(llX Z(Y)l121Y).- W )

Here we explicitely marked which variable th e expectation operates on. Now,the CM estimate minimizes the second expectation for all values on Y . Thus,the weighted version, defined by th e expectation with respect t o Y must beminimized by the CM estimator for each Y = y. That is, as an estimator, theunconditional MV and CM also coincide.

13 2 3 Derivation of marginal distribution

Start with the easily checked formula

P

(13.3)

and Bayes' rule

From (13.3)we get

and the ratio of determinants can be simplified. We note th a t th e new Gaus-sian distribution must have P,, - PxgP 'Pyx s covariance matrix .



13.2 Conditionalxpectations 459

where

2 = P x + Pzy P;; (Y Py .

From this, we can conclude that

1P X l Y X, =

det(Pzz PzyP j1Pyz)1/2

which is a Gaussian distribution with mean and covariance as given in (13.2).

13 2 4 Example: derivation of the Kalman filter

As an illustration of conditional expectation, an inductive derivation of theKalman filter will be given, for the st at e space model

~ t + 1 =Axt + ut, ut E N(O, Q )yt =Cxt +et, et E N(O,R)



460 Linear

Induction implies that Q, given yt, is normally distributed.

13.3. Wiener iltersThe derivation and interpretations of the Wiener filter follows Hayes (1996).

13.3.1. Basics

Consider the signal model

yt = st + et. (13.4)

The fundamental signal processing problem is to separate the signal st fromthe noise et using the measurements yt. Th e signal model used in Wiener's ap-proach is to assume th at th e second order properties of all signals are known.When s t and et are independent, sufficient knowledge is contained in the cor-relations coefficients

r s s W = E h - k )

r e e ( k ) =E(ete;-k),

and similarly for a possible correlation r s e ( k ) . Here we have assumed thatthe signals might be complex valued and vector valued, so * denotes complexconjugate transpose. The correlation coefficients (or covariance matrices) may



13.3

in tu rn be defined by parametric signal models. For example, for a stat e spacemodel, the Wiener filter provides a solution to the stationary Kalman filter,as will be shown in Section 13.3.7.

The non-causal Wiener f i l teris defined by

(13.5)i = - w

In the next subsection, we study causal and predictive Wiener filters, butthe principle is the same. Th e underlying idea is t o minimize a least squarescriterion,

h =a rgminV(h ) = argminE(Et)2 = argminE(st ( h ) ) 2 (13.6)h h h

CO

= argminE(st y *h ) t ) 2 = argminE(st c iyt-i)2, (13.7)h h

i z - 0 0

where the residual = st t and the least squares cost V ( h ) re defined in astandard manner. Straightforward differentiation and equating t o zero gives

(13.8)

This is the projection theorem, see Section 13.1. Using the definition of cor-relation coefficients gives

COc iryg(k = r S g ( k ) , m < I < m . (13.9)i = - a

These are the Wiener-H opf equations, which are fundamental for Wiener fil-tering. There are several special cases of th e Wiener-Hopf equations, basicallycorresponding to different summation indices and intervals for k .

The FIR Wiener filter H ( q ) = h0 +h l q - l + ... +h,-lq-(n-l) correspondsto

n - lc irgg(k = r s y ( k ) , k = 0 , 1 ,...,n 113.10)i O

The causal (IIR) Wiener filter H ( q ) = h0 + h1q-l + ... corresponds t o

CO



462 Linear

The one-step ahead predictive (IIR) Wiener filter H ( q ) = h1q-l +h2qP2 + ... corresponds to

CO

C h i r y y ( k = rsy(L), 1 5 L < Co. (13.12)i=l

The FIR Wiener filter is a special case of the linear regression frameworkstudied in Part 111, and the non-causal, causal and predictive Wiener filters rederived in the next two subsections. The example in Section 13.3.8summarizesthe performance for a particular example.

An expression for the estimation error variance is easy to derive from theprojection theorem (second equal ity):

Var(st t = E(st )2

= E(st ) S t

(13.13)

This expression holds for all cases, the only difference being the summationinterval.

13 3 2 The non-causal Wiener filter

To get an easily computable expression for the non-causal Wiener filter, write(13.9) as a convolution ( ryy h) (L) = rsy L) . The Fourier transform of aconvolution is a multiplication, and the orrelation coefficients become spectra ldensities, H ( e i w ) Q y y ( e i w ) Q s y e i w ) . Thus, the Wiener filter is

H e Z W ). Q s y e i w )

Q yy eiw ) '

or in the z domain

(13.14)

(13.15)

Here the x-transform is defined as F x ) = C f k x P k , so that stability of causalfilters corresponds to IzI < 1. This is a filter where the poles occur in pairsreflected in th e unit circle. It s implementation requries either a factorizationor partial fraction decomposition, and backward filtering of the unstable part.



13 .3 Wiener 463

Figure 13.1. The causal Wiener filter H z ) = G + ( z ) F ( z ) an be seen as cascade of awhitening filter F ( z ) and a non-causal Wiener filter G+(.) with white noise input E t .

13.3.3. The causal Wiener ilter

Th e causal Wzenerfi lteris defined as in (13.6), with the restriction th at h k = 0for k < 0 so that future measurements are not used when forming d t . Theimmediate idea of trunca tin g the non-causal Wiener filter for k < 0 does notwork. The reason is tha t th e information in future measurements can be par-tially recovered from past measurements due to signal correlation. However,the op tim al solution comes close to this argumentation, when interpreting apart of the causal Wiener filter as a whitening filter. Th e basic idea is th atthe causal Wiener filter s th e causal part of th e non-causal Wiener filter f themeasurements are white noise

Therefore, consider the filter structure depicted in Figure 13.1. If yt has arational spectral density, spectral factorization provides th e sought whiteningfilter,

where Q(z ) is a monic ( q ( 0 ) = l ) ,For real valued signals, it holds onwritten Q g g z )= a i Q ( z ) Q ( l / ~ ) .given as

(13.16)

stable, minimum phase and causal filter.the unit circle that the spectrum can bestable and causal whitening filter is then

(13.17)

Now the correlation function of white noise is r E E k )= B k , so th e Wiener-Hopfequation (13.9) becomes

(13.18)

where {g} denotes t he impulse response of the white noise Wiener filter inFigure 13.1. Let us define th e causal part of a sequence { x ~ } ~ ~n the zdomain as [ X ( x ) ] + . hen, in the X domain (13.18) can be written as



464 Linear

It remains to express the spectral density for the correlation stet* in terms ofthe signals in (13.4). Since E; = F * ( l / x * ) , th e cross spectrum becomes

To summarize, the causal Wiener filter is

It is well worth noting th at the non-causal Wienersimilar way:

(13.20)

(13.21)t

filter can be written in a

(13.22)

Th at is, both the causal and non-causal Wiener filters can be interpreted as acascade of a whitening filter and a second filter giving the Wiener solution forthe whitened signal. Th e second filter's impulse response is simply truncatedwhen the causal filter is sought.

Finally, to actually compute the causal part of a filter which has poles bo thinside and outside the unit circle, a partial fraction decomposition is needed,

where th e fraction corresponding to the causal part has all poles inside theunit circle and contains the direct term, while the fraction with poles outsidethe unit circle is discarded.

13.3.4. Wiener signal predictor

The Wiene r m-step signal predictor is easily derived from the causal Wienerfilter above. Th e simplest derivation is to truncate the impulse response ofthe causal Wiener filter for a whitened input at another time instant. Figure

13.2(c) gives an elegant presentation and relation to the causal Wiener filter.The same ine of arguments hold for th e Wiener fixed-lag smoother as well;

just use a negative value of the prediction horizon m.

13.3.5. Analgorithm

The general algorithm below computes the Wiener filter for bo th cases ofsmoothing and prediction.

Algorithm 73.7 Causal, predictive and smoothing Wiener filterGiven signal and noise spectrum. The prediction horizon is m, th at is, mea-surements up to time t - m are used. For fixed-lag smoothing, m is negative.



13.3 Wiener 465

(a) Non-causal Wiener filter

(b) Causal Wiener filter

(c) Wiener signal predictor

Figure 13.2. The non-causal, causal and predictive Wiener filters interpreted as a cascadeof a whitening filter and a Wiener filter with white noise inp ut. The filter Q z ) is given byspectral factorization e Y Y ( z ) a ; ( z ) * ( i / z * ) .

1. Compute the spectral factorization Qyy(x)= oiQ(z)Q*(l/z*) .

2. Com pute the part ial fraction expansion

G q ( z ) G - 7 2

3. The causal part is given by

and the Wiener filter is

The partial raction expansion is conveniently done in MATLABTMby using

the residue function. To get the correct result, a small trick is needed. Factorout z from the left hand side and write z g* 1,2*)

there is no z to factor out, include one in the denominator. Here B+, k , B- are

zm--l@Js v ( 2 ) = B + ( Z ) + k + B-(2) . If



466 Linear

the outpu ts from residue and B + ( z ) contains all fractions with poles insidethe unit circle, and the o ther way around for B - ( x ) . By this trick, the directterm is ensured to be contained in G + ( z )= z ( B + ( z )+ k ) .

13.3.6. Wiener measurement predictor

The problem of predicting th e measuremen t ra ther than the signal turns outto be somewhat simpler. The assumption is th at we have a sequence of mea-surements yt th at we would like to predict. Note th at we temporarily leavethe standard signal estimation model, in that there is no signal st here.

The k-step ahead Wiener predictor of the measurement is most easily de-rived by reconsidering the signal model yt = st +et and interpre ting the ignal

as st=

yt+k. Then the measurement predictor pops ou t as a special case ofthe causal Wiener filter.The cross spectrum of the measurement and signal is QsY(z) = zkQYY(z).

Th e Wiener predictor is then a special case of the causal Wiener filter, andthe solution is

Gt+k = H k ( q ) Y t

1 Xk@YY 4[Q*(l /z*)] + .(13.23)

As before, (X) is given by spectral factorization Qyy(x) = ~ i Q ( z ) Q * ( l / z * ) .Note th at , in this formulation there is no signal s t , just the measured signalyt. If, however, we would like to predict th e signal component of (13.4), thenthe filter becomes

+k = H k ( x ) y t

(13.24)

which is a completely different filter. The one-step ahead Wiener predictor oryt becomes particularly simple, when substitu ting the spectra l factorizationfor QYy(x) n (13.23):

+l =Hl(x)Yt

Since (X) is monic, we get the causal part as

(13.25)



13 .3 Wiener 467

That is, the Wiener predictor is

H&) = X 1

Q t , )(13.26)

Example 13.1 AR predictor

Consider an AR process

with signal spectrum

@YY(4 =

:

A(x)A*(l /x*)

The one step ahead predictor is (of course)

H'(x) = ~ ( 1 A ( x ) ) = -a1 a2q-l ... a,q -, l.

13 3 7 The stationary Kalman smoother as a Wiener filter

The stationary Kalman smoother in Chapter 8 must be identical to the non-causal Wiener filter, since bo th minimize 2-norm errors. Th e la tter s, however,much simpler to derive, and sometimes also to compute. Consider t he sta tespace model,

yt = Cxt +et.vS t

Assume that ut and et are ndependent scalar white noises. The spectraldensity for the signal st is computed by

The required two spectral densities are



468 Linear

The non-causal Wiener filter is thus

I C z 1 - A)-1B,12 a:

I C z 1 - A)-lB,I2 a,2 + a:'

H ( z )=

The main compu tationa l work is a spectral factorization of the denominatorof H ( z ) .This should be compared to solving an algebraic Riccati equation toget the stationary Kalman filter.

Of course, the stationa ry Kalman ilter and predictor can also be computedas Wiener filters.

13.3.8. A numerical example

Com pute the non-causal Wiener filter and one-step ahead predictor for th eAR(1 ) process

st 0 . 8 ~ ~ ~ 10.6vt

Y t = S t + e t ,

with unit variance of et and u t , respectively. On the unit circle, th e signalspectrum is

0.62

0.360.452@'S'S(') = 11 0.82-1 = (1 0 . 8 ~ - ~ ) ( 10.82) 2 O.~) X 1.25)'

and the other spectra are QSy(z) = QSs(z) nd Q g g z ) = Qss(z) +Q e e z ) , incest and et are assumed independent:

-0.452 z2 2.52 + 1Q Y Y ( 4 = l =

2 0.8)(2 1.25) 2 0 . 8 ) ( z 1.25).

The non-causal Wiener filter is

Q'SY 4 -0.450.45H ( 2 ) = = zQYy(2) x2 2.52 + 1 2 2) (2 0 .5 )

= 2

-0.30.32.32 +-2 - 2- 0 . 5 ' -GP(.) G+(.)

which can be implemented as a double recursion



13 .3 Wiener 469

As an a lternative, we can split the par tial fractions as follows:

Y ( 4 -0.452 -0.452 -0.6 -0.15H ( 2 ) =

Q Y y ( 2 ) .z2 - 2.52 + 1 2 2)(2 0.5) z 2 z 0.5'-

vH -+

which can be implemented as a double recursion

Bt =B + 2

S =0.58:-, 0.15yt-l

2 =0.58;+1 + 0.3yt.

Both alternatives lead to th e same result for i t , but the most relevant one hereis to include the direct term in the forward filter, which is the first alterna-tive. Now we tu rn to th e one-step ahead ( m = 1) prediction of S t . Spectralfactorization and partial fraction decomposition give

Q Y Y M =

2 - 2- 0 . 5v- 1.25 z 0.8.-

Q*P / z 1 Q .)

21 Q ' s Y ( 4 -0.452 0.320.752

Q*(l /x*) ='(X 0.8) x 2) X 0.8 X 2

+(.)P(.)

+-.

Here the causal part of th e Wiener filter G ( z ) n Figure 13.1 is denoted G + ( z ) ,and the anti-causal part G- 2). The Wiener predictor is

Note that the poles are the same (0.5) for all filters.

summarizes the least squares theoretical loss function for different filters:Filter operation Loss function

12o filter. B t = utNumerical loss

An interesting question is how much is gained by filtering. The table below

One-stepheadrediction 0.375 . 0A2 +0.62 0.6Non-causal Wiener filter

0.40480 ) c , '=- )irst order FIR filter

0.37500 ) c, =,- )ausal Wiener filter0.3a:1h 0)1

The table gives an idea of how fast t he information decays in the measure-

ments. As seen, a first order FIR filter is considerably better than just takingBt = yt, and only a minor improvement is obtained by increasing the numberof parameters.



Signal models and notation

eneral

Th e signals and notation below are common for the whole book.

NameN

Ltn ,Yt

U t

etet

ka

nk nSt

xt

t

VariableNumber of da ta off-line)

Number of d a ta in sliding windowTime index in un its of sample intervals)Dimension of the vectorMeasurementInput known)Measurement noiseParameter vectorStateChange time

Number of change timesVector of change timesDistance measure in stopping ruleTest statistic in stopping rule

Signal estimation models

The change in the mean mo el is

y t + et, Var et) ,

DimensionScalar

ScalarScalarScalar

ny12nYno d

nXScalar

ScalarnScalarScalar





472 Sianalodels and notation

and the change in the variance model is

In change detection, one or both of Qt and are assumed piecewise constan t,with change times k l IQ . ,kn kn . Sometimes it is convenient to introducethe star t and end of a data set as ko 0 and kn N leaving n - degreesof freedom for the change times.

Parametric models

Name Covariance matriximensionariable

et

styesidual yt pTet-1t

pt Ptlte, darameter estimatet

Rt or o2eeasurement noise

Th e generic signal model has t he form of a linear regression:

Y t = Pt Qt + et.

In adaptive filtering, t is varying for all time indexes, while in change detectionit is assumed piecewise constant, or at least slowly varying in between thechange times.

Important special cases are the F I R model

R model



473

and RX model

A general parametric, dynamic and deterministic model for curve fitting canbe written as

State space models

VariableStateSta te noise unknown stoch. input)State disturbance unknown det. input )Sta te fault unknown input)Measurement noiseSta te space matricesFiltered state estimatePredicted state estimateFiltered output estimate

Predicted output estimateInnovation or residual yt -

A.17)

A.18)

The standard form of the state space model for linear estimation is

Additive st ate and sensor changes are caught in th e model



474 Sianalodels and notation

where U is the additive fault magni tude. A completely deterministic statespace model is

where f t is the additive time-varying fault profile. All in all, th e most generallinear model is

Multi-model approaches can in their most general form be written as

where is a discrete and finite mode parameter.



ault detection terminology

The following list of terminology is adopted from Isermann and Balle 1997)’and reflects the notation used in the field of fault detection and diagnosis Ithas been promoted at IFAC workshop ‘SAFEPROCESS’ to ac t like a unifyingterminology in this field. Reproduced by permission of Prof. Rolf Isermann,IFAC SAFEPROCESS Technical Committee Chair 1991-1994.)

States and Signals

Fault

Failure

Malfunct ion

Error

DisturbancePerturbation

Residual

S y m p t o m

Unpermitted deviation of at least one characteristicproperty or parameter of the system from acceptableusual stan dard condition.Permanent interruption of a systems ability to perform a

required function under specified operating conditions.Intermittant irregularity in fulfilment of a systems desiredfunction.Deviation between a measured or computed value of anoutput variable) an d the tru e, specified or theoreticallycorrect value.An unknown and uncontrolled) input acting on a system.An input acting on a system which results in a temporarydeparture from current state.Fault indicator, based on deviation between measurmentsand model-equation-based computations.Change of an observable quantity from normal behaviour.





476 Fault dete ction terminoloav

unctions

Fault detection

Fault isolation

Fault identij ication

Fault diagnosis

Monitoring

Supervision

Protection

Models

Determination of faults present in a system and timeof detection.

Determination of kind, location and time ofdetection of a fault. Follows fault detection.Determination of the size and time-variant behaviourof a fau lt. Follows fault isolation.Determination of kind, size, location and time of afault. Follows fault detection. Includes faultisolation and identification.

continuous real time task of determining theconditions of a physical system, by recordinginformation recognising and indicating anomalies ofthe behaviour.Monitoring a physical system and taking appropriateactions to maintain the opera tion in the case offaults.Means by which a potentially dangerous behaviourof the system is suppressed if possible, or means bywhich the consequences of a dangerous behaviour are

avoided.

Quantitative model

Qualitative model

Diagnositic model

Analytical redundancy

Use of stat ic and dynamic relations amongsystem variables and parameters in order todescribe systems behaviour in quantitativemathematical terms.Use of stat ic and dynamic relations amongsystem variables and parameters in order todescribe systems behaviour in qualitative termssuch as causalities or if-then rules.

set of static or dynamic relations which linkspecific input variables he symptoms ospecific ou tput variables th e faults.Use of two or more, but not necessarily identicalways to determine a variable where one way usesa mathematical process model in analytical form.



ibliography

G.A. Ackerson and K.S. Fu On state estimation in switching environments. IEEE

Transactions on Automatic Control , 15:10, 1970.

L. Ahlin and J. Zander. Principles of wirelesscommunications. Studentlitteratur,Lund, Sweden, second edition, 1998. ISBN 91-44-34551-8.

M. Aitken. Posterior Bayes factors. Journal of the Royal Statistical Society, SeriesB 53:111-142,1991.

H.Akaike. Fitting autoregressive models or prediction. Anna ls of In st i tu te orStatistical Mathematics, 21:243-247, 1969.

H. Akaike. Information theory and an extension of the maximum likelihood principle.In 2nd International Symposium on Information Theory, Tsahkadsor, SSR 1971.

H. Akaike. On entropy maximization principle. In Sy mp osi um on AppZications ofStat ist ics, 1977.

H. Akashi and H. Kumamoto. Random sampling approach to state estimation inswitching environment. Automat ion, 13:429, 1977.

A. Alessandri, M . Caccia, and G. Veruggio. Fault detection of actuator faults inunmanned underwater vehicles. Control Engineering Practice, 7 3):357-368, 1999.

S.T. Alexander. Adaptive ignal rocessing.Theory nd pplications. Springer-Verlag, New York, 1986.

B.D.O. Anderson and J.B. Moore. Optimal filtering.Prentice Hall, Englewood Cliffs,

NJ., 1979.J.B. Anderson and S. Mohan. Sequential coding algorithms: A survey and costanalysis. IEEE Transactions on Communications, 32:1689-1696, 1984.

P. Andersson. Adaptive forgetting in recursive identification through multiple mod-els. Intern ation al Journ al of Control, 42 5):1175-1193,1985.

R. Andre-Obrecht. A new statistical approach for automatic segmentation f contin-uous speech signals. IEEE Transactions on Acoustics, Speech and Signal Processing,36:29-40, 1988.

U. Appel and A.V. Brandt. Adaptive sequential segmentation of piecewise stationarytime series. Information Sciences, 29 1):27-56, 1983.

K.J. h t r o m and B. Wittenmark. Co mp ute r controlled syste ms.Prentice Hall, 1984.





478 Bibliography

K.J. h t r o m and B. Wittenmark. Adaptive control. Addison-Wesley, 1989.

T. Aulin. Asymptotically optimal oint channel equalization, demodulation andchannel decoding with minimal complexity and delay. In 1991 Tirrenia International

Workshop on Digital Comm unications, pages 1-1 1991.B. Bakhache and I. Nikiforov. Reliable detection of faults in navigation systems. InConference of Decision and Control, Phoenix, AZ, 1999.

Y. Bar-Shalom and T. Fortmann. TrackingandData Association, volume 179 ofMathematics in Science and Engineering. Academic Press, 1988.

Y. Bar-Shalom and X.R. Li. Estimationand racking:principles, echniques,andsoftware. Artech House, 1993.

M. Basseville and A. Benveniste. Design and comparative study of some sequential

jumpdetect ion algorithms for digital signals. IEEE TransactionsonAcoustics,Speech and Signal Processing, 31:521-535, 1983a.

M. Basseville and A. Benveniste. Sequential detection of abrupt changes in spectralcharacteristics of digital signals. IEEE Transactionson nformationTheory, 29:709-724, 198313.

M. Basseville and A. Benveniste. Sequential segmentation of nonstationary digitalsignals. Information Sciences, 29:57-73, 1983c.

M. Basseville and I.V. Nikiforov. De tec tio n of abrupt changes: theory and applica-tion. Information and system science series. Prentice Hall, Englewood Cliffs, NJ.1993.

L.E. Baum, T. Petrie, G.Soules, and N. Weiss. A maximization technique occuringin the statistical analysis of probabilistic functions of Markov chains. The Ann als ofMathematical Statistics, 41:164-171, 1970.

J. Bauschlicher, R. Asher, and D. Dayton. A comparison of Markov and constantturn ra te models in an adaptive Kalman filter tracker. In Proceedings of the IEEE

NationalAerospaceandElectronicsConference,NAECON, 1989 pages 116-123,1989.

M. Bellanger. Adaptivedigital iltersandsignalanalysis. Marcel Dekker, Boston,1988.

A. Benveniste, M. Basseville, and B.V. Moustakides. The asymptotic local approachto change detection and model validation. IEEE Transactions on Automatic Control ,32:583-592, 1987a.

A. Benveniste, M. Metivier, and P. Priourret. Algorithmes adaptifs et approximationsstochastiques. Masson, Paris, 1987b.

N Bergman. Recursive Bayesian Estimation: Navigation and Tracking Applications.Dissertation nr. 579, Linkoping University, Sweden, 1999.

N. Bergman and F. Gustafsson. Three statistical batch algorithms for trackingmanoeuvring targets. In Proceedings 5th European Control Conference , Karlsruhe,Germany, 1999.



Bibliography 479

G.J. Bierman. Factorization methods f o r discrete sequential estimation. AcademicPress, New York, 1977.

S S Blackman. Multiple-targetrackingwith adar pp lications. Artech House,

Norwood, MA, 1986.W.D. Blair an d T. Bar-Shalom. Tracking maneuvering argets with multiple sensors:does more da ta always mean better estimates? IEEE Transactions on Aerospaceand Electronic Systems, 32 1):450456, 1996.

M. Blanke, R. Izadi-Zamanabadi, S.A. Bogh, and C.P. Lunau. Fault-tolerant controlsystems a holistic view. Control Engineering Practice, 5 5):693-702, 1997.

H.A.P. Blom and Y. Bar-Shalom. Interacting multiple model algorithm for systemswith markovian switching coefficients. IEEE Transactions on Automatic Control ,8 :

780-783, 1988.T. Bohlin. Four cases of identification of changing systems. In Mehra and Lainiotis,editors, System identification: advances and case studies.Academic Press, 1976.

C. Breining, P. Dreiscitel, E. Hansler, A. Mader, B. Nitsch, H. Puder, T. Schertler,G. Schmidt, and J . Tilp. Acoustic echo control. An application of very-high-orderadaptive filters. IE E E Signal Processing Magazine, 16 4):42-69, 1999.

R.G. Brown and P.Y.C. Hwang. Introduction to random signals and applied K al ma nfiltering. John Wiley Sons, 3rd edition, 1997.

J.M. Bruckner, R.W. Scott, and R.G. Rea. Analysis of multimode systems. IEEETransactions on Aerospace and Electronic Systems, 9:883, 1973.

M. Caciotta and P. Carbone. Estimation of non-stationary sinewave amplitude viacompetitive Kalman filtering. In IEEE Instrumentation and Measurement Technol-ogy Conference, 1996. IMTC-96.Con feren ce Proceeedings. Qua lityMeasurements:The Indispensable Bridge between Theo ry and Reality, pages 470-473, 1996.

M.J. Caputi. A necessary condition for effective performance of the multiple modeladaptive estimator. IEEE Transactions on Aerospace and Electronic Systems,31 3):1132-1139,1995.

C. Carlemalm and F. Gustafsson. On detection and discrimination of double talkand echo path changes in a telephone channel. In A. Prochazka, J. Uhlir, P.J.W.Rayner, and N.G. Kingsbury, editors, Signal analysis and prediction, Applied andNumerical Harmonic Analysis. Birkhauser Boston, 1998.

C.K. Carter and R. Kohn. On Gibbs sampling for state space models. Biometr ika ,81~541-553, 1994.

B. Casella and E.I. George. Explaining the Gibbs sampler. The Amer ican Sta t i s t i -cian, 46 3):167-174, 1992.

W.S. Chaer, R.H. Bishop, and J. Ghosh. Hierarchical adaptive Kalman filtering forinterplanetary orbit determination. IEEE Transactions on Aerospace and ElectronicSystems, 34 3):883-896, 1998.



48 Bibliography

C.G. Chang and M. Athans. State estimation for discrete systems with switchingparameters. IEEE Transactions on Aerospace and Electronic Systems, 14:418, 1978.

J . Chen and R.J. Patton. Rob ust model-based fau lt diagnosis for dynamic systems.

Kluwer Academic Publishers, 1999.Sau-Gee Chen, Yung-An K m , and Ching-Yeu Chen. On heproperties of thereduction-by-composition LMS algorithm. IEEE Transactions on Circuits and Sys-t ems 11: Analog and Digital Signal Processing, 46 11):1440-1445, 1999.

P.R. Chevillat and E. Eleftheriou. Decoding of trellis-encoded signals in the presenceof intersymbol interference and noise. IEEE Transactions on Communications, 37:669-676, 1989.

E.Y. Chow and A.S. Willsky. Analytical redundancy and the design of robust failuredetection systems. IEEE Transactions on Automatic Control , 29 7):603-614, 1984.

C.K.Chui and G. Chen. Kalman ilteringwithreal-timeapplications. Springer-Verlag, 1987.

R.N. Clark. The dedicated observer approach t o instrument fault. In Proc. CDCpages 237-241. IEEE, 1979.

C. F. N. Cowan and P. M. Grant. Adaptive filters. Prentice-Hall, Englewood Cliffs,NJ , 1985.

J r C.R. Johnson. Lectures on adaptive parameter estimation. Prentice-Hall, Engle-wood Cliffs, N J 1988.

Jr. C.R. Johnson. On he interaction of adaptive filtering, identification, and control.IE E E Signal Processing Magazine, 12 2):22-37, 1995.

M. Daumera and M. Falka. On-line change-point detection for sta te space models)using multi-process Kalman filters. Linea r Algebra and its Applications, 284 1-3):1255135,1998.

L.M. Davis, .B. Collings, an d R.J. Evans. Identification of time-varying inearchannels. In 1997 IEEE Internatio nal Conferen ce on Aco ustics, Speech, and SignalProcessing, 1997. IC AS SP -97 ,pages 3921-3924, 1997.

L.D. Davisson. The prediction error of stationary Gaussian time series of unknowncovariance. IEEE Transac t ions on In fo rmat ion Theory, 11:527-532, 1965.

C.S. Van Dobben deBruyn. Cumulat ivesum ests: heory andpractice. Hafner,New York, 1968.

J.F.G. de Freitas, A. Doucet, and N.J. Gordon, editors. SequentialMonteCarlomethods in practice. Springer-Verlag, 2000.

P. de Souza and P. Thomson. LPC distance measures and statistical ests withparticular reference to the ikelihood ratio. IEEE Transactions on Acoustics, Speech

and Signal Processing, 30:304-315, 1982.T . Dieckmann. Assessment of road grip by way of measured wheel variables. InProceedings of F I S I TA , London, June 1992.



Bibliography 481

L. Dinca, T. Aldemir, and G. Rizzoni. A model-based probabilistic approach forfault detection and identification with application to the diagnosis of automotiveengines. IEEE Transactions on Automatic Control,44 11):2200-2205, 1999.

X. Ding and P.M. Frank. Fault detection via factorization approach. SystemsControl Letters, 14 5):431-436, 1990.

X. Ding, L. Guo, and T. Jeinsch. A characterization of parity space and its appli-cation to robust fault detection. IEEE Transactions on Automatic Control,44 2):337-343, 1999.

P. Djuric. Segmentation of nonstationary signals. In Proceedings of ICASSP 92volume V, pages 161-164, 1992.

P. Djuric. A MAP solution to off-line segmentation of signals. In Proceedings o

ICASSP 94, volume IV, pages 505-508, 1994.A. Doucet. On sequential simulation-based methods for Bayesian filtering. TechnicalReport TR 310, Signal Processing Group, University of Cambridge, 1998.

A.Duel-Hallen and C.Heegard. Delayed decision-feedback sequence estimation.IEE E Transactions on Com munications,37:428-436, 1989.

M. Efe and D.P. Atherton. Maneuvering target tracking with an adaptive Kalmanfilter. In Proceedings of the 37th IE EE Conference on Decision and Control, 1998.

P. Eide and P. Maybeck. Evaluation f a multiple-model failure detection system for

the F-16 in a full-scale nonlinear simulation. In Proceedings of the IEEE NationalAerospace and Electronics Conference (N AE CON),pages 531-536, 1995.

P. Eide and P. Maybeck. An MMAE failure detection system for the F-16. IEEETransactions on Aerospace and Electronic Systems,32 3):1125-1136, 1996.

E. Eleftheriou and D. Falconer. Tracking properties and steady-state performanceof rls adaptive filter algorithms. IEEE Transactions on Acoustics, Speechand SignalProcessing, ASSP-34:1097-1110, 1986.

S.J. Elliott and P.A. Nelson. Active noise control. EEE Signal Processing Magazine,10 4), 1993.

E. Eweda. Transient performance degradation of the LMS, RLS, sign, signed regres-sor, and sign-sign algorithms with da ta correlation. IEEE Transactions o n Circuitsand Systems 11: Analog and Digital Signal Processing, 46 8):1055-1062, 1999.

M.V. Eyuboglu and S.U. Qureshi. Reduced-state sequence estimation with set par-titioning and decision feedback. IEEE Transactions on Communications,36:13-20,1988.

A. Farina. Target tracking with bearings-only measurements. Signal Processing, 781):61-78, 1998.

W.J. Fitzgerald, J.J.K. Ruanaidh, and J.A. Yates. Generalised changepoint detec-tion. Technical Report CUED/F-INFENG/TR 187, Department of Engineering,University of Cambridge, England, 1994.



482 Bibliography

G.D. Forney. The Viterbi algorithm. Proceedings of t he IE EE , 61:268-278, 1973.

G.D. Forney. Minimal bases of rational vector spaces, with applications to multi-variable linear systems. SIA M Journ al of Control ,13 3):493-520, May 1975.

U. Forssell. Closed-loop dentification:Methods,Theory, ndApplications. PhDthesis, Linkoping University, Sweden, 1999.

P.M. Frank. Fault diagnosis in dynamic systems using analytical and knowledge-based redundancy a survey and some new results. Automatica, 26 3):459474,1990.

P.M. Frank and X. Ding. Frequency domain approach to optimally robust residualgeneration and evaluation for model-based fault diagnosis. Automatica, 30 5):789-804, 1994a.

P.M. Frank and X. Ding. Frequency domain approach to optimally robust residualgeneration and evaluation for model-based fault diagnosis. Automatica, 30 5):789-804, 199413.

B.Friedlander. Lattice ilters or adaptive processing. Proceedings IEEE, 70 8):829-867, August 1982.

E. Frisk and M. Nyberg. A minimal polynomial basis solution o residual generationfor fault diagnosis in linear systems. In IFAC 99, Beijing, China, 1999.

W.A. Gardner. Cyclostationarity in communicationsandsignalprocessing. IEEEPress, 1993.

J. Gertler. Analytical redundancy methods n fault detection and solation; sur-vey and synthesis. In IF AC Fault Detection, Supervision and Safety for TechnicalProcesses, pages 9-21, Baden-Baden, Germany, 1991.

J. Gertler. Fault detection and isolation using parity relations. Control EngineeringPractice, 5 5):653-661, 1997.

J. Gertler. Faultdetectionanddiagnosis in engineeringsystems. Marcel Dekker,1998.

J. Gertler and R. Monajemy. Generating directional residuals with dynamic parityrelations. Automatica, 31 4):627-635, 1995.

W. Gilks, S. Richardson, and D. Spiegelhalter. Marlcov ch ain Monte Carlo in prac-tice. Chapman Hall, 1996.

K . Giridhar, J.J. Shynk, R.A. Iltis, and A. Mathur. Adaptive MAPSD algorithmsfor symbol and timing recovery of mobile radio TDMA signals. IEEE Transact ionson Communicat ion, 44 8):976-987, 1996.

G.-0. Glentis, K . Berberidis, and S. Theodoridis. Efficient least squares adaptivealgorithms for FIR transversal filtering. IEEE Signal Processing Magazine, 16 4):

13-41, 1999.

G.C. Goodwin and K.S. Sin. Adaptive filtering, prediction and control.Prentice-Hall,Englewood Cliffs, N J 1984.



Bibliography 483

P. Grohan and S. Marcos. Structures and performances of several adaptive Kalmanequalizers. In IEEE DigitalSignalProcessingWorkshopProceedings, 1996 pages454-457, 1996.

S. Gunnarsson. Frequency domain accuracy of recursively identified ARX models.International Journal Control, 54:465480, 1991.

S Gunnarsson and L. Ljung. Frequency domain tracking characteristics f adaptivealgorithms. IEEETransact ionsonAcoustics, SpeechandSignalProcessing, 37:1072-1089,1989.

F. Gustafsson. Est imat ion of discreteparameters in linear systems. Dissertationsno. 271, Linkoping University, Sweden, 1992.

F. Gustafsson. The marginalized likelihood ratio test for detecting abrupt changes.IEEE Transactions on Automatic Control , 41 1):66-78, 1996.

F. Gustafsson. A generalization of MDL for choosing adaptation mechanism anddesign parameters in identification. In SYSID’97 Japan, volume 2 pages 487-492,1997a.

F. Gustafsson. Slip-based estimation of tire road riction. Automatica, 33 6):1087-1099,1997b.

F. Gustafsson. Estimation and change detection of tire road friction using thewheel slip. IE E E Control System Magazine, 18 4):4249, 1998.

F. Gustafsson and S. F. Graebe. Closed loop performance monitoring in he presenceof system changes and disturbances. Automatica, 34 11):1311-1326, 1998.

F. Gustafsson, S. Gunnarsson, and L. Ljung. On imefrequency resolution of signalproperties using parametric techniques. IEEE Transactions onSignalProcessing,45 4):1025-1035, 1997.

F. Gustafsson and H. Hjalmarsson. 2 1 ML estimators for model selection. Automat -ica, 31 10):1377-1392, 1995.

F. Gustafsson and A . Isaksson. Best choice of coordinate system for tracking coor-dinated turns. Submitted to IEEE Transactions on Automatic Control , 1996.

F. Gustafsson and B.M. Ninness. Asymptotic power and the benefit of under-modeling in change detection. In Proceedings on th e 995 European Control Con-ference, Rome, pages 1237-1242,1995.

F. Gustafsson and B. Wahlberg. Blind equalization by direct examination of theinput sequences. IEEE Transact ions on Communicat ions, 43 7):2213-2222, 1995.

P-0. Gutman and M. Velger. Tracking targets using adaptive Kalman filtering.IEEE Transac tions on Aerospace and Electronic Systems, 26 5):691-699, 1990.

E.J. Hannan and B.G. Quinn. The determination of the order of an autoregression.

Journal Royal Statistical Society, Series B 41:190-195, 1979.

M.H. Hayes. Statistical Digital Signal Processing and Modeling. John Wiley Sons,Inc., 1996.



484 Bibliography

S Haykin. Digital commu nication. John Wiley Sons, Inc., 1988.

S Haykin, editor. Blind deconvolution. Prentice-Hall, 1994.

S Haykin. Adaptive filter theory. Information and System Sciences. Prentice-Hall,3rd edition, 1996.

S. Haykin, A.H. Sayed, J. Zeidler, P. Yee, and P. Wei. Tracking of linear ime-variant systems. n IEEEMili taryCommunicationsConference, i995. M I L C O M95, Conference Record, pages 602-606, 1995.

J. Hellgren. Compensation for hearing loss a nd cancelation of acoustic feedback indigital hearing aids. PhD thesis, Linkoping University, April 1999.

Tore Hagglund. New estimation technique s fo r adaptive control. PhD thesis, LundInstitute of Technology, 1983.

M. Holmberg, F. Gustafsson, E.G. Hornsten, F. Winquist, L.E. Nilsson, L. Ljung,and I. Lundstrom. Feature extraction from sensor da ta on bacterial growth. Biotech-nology Techniques, 12 4):319-324, 1998.

J. Homer, I. Mareels, and R. Bitmead. Analysis and control of the signal dependentperformance of adaptive echo cancellers in 4-wire loop telephony. EEE Transact ionson Circuits and Systems : AnalogandDigitalSignalProcessing, 42 6):377-392,1995.

L. Hong and D. Brzakovic. GLR-based adaptive Kalman filtering in noise removal.

In IEEE International Conference on Systems Engineering, i990. pages 236-239,1980.

Wang Hong, Zhen J. Huang, and S. Daley. On the use of adaptive updating rulesfor actuator and sensor fault diagnosis. Automatica, 33 2):217-225, 1997.

M. Hou and R. J. Patton. Input observability and input reconstruction. Automatica,34 6):789-794, 1998.

P.J. Huber. Robust statistics. John Wiley Sons, 1981.

C.M. Hurvich and C. Tsai. Regression and ime series model selection n small

samples. Biometrika, 76:297-307, 1989.A. Isaksson. On system identif icat ion in one and wo dimensions with signal pro-cessingapplications. PhD hesis, Linkoping University, Department of ElectricalEngineering, 1988.

R Isermann. Supervision, fault-detection and fault-diagnosis methods an intro-duction. Control Engineering Practice, 5 5):639-652, 1997.

R. Isermann and P. Balle. Trends in the application of model-based fault detectionand diagnosis of technical processes. Control Engineering Practice, 5:707-719, 1997.

A.G. Jaffer and .C. Gupta. On estimation of discrete processes under multiplicativeand additive noise conditions. Information Science, 3:267, 1971.

R. Johansson. System modeling and identification. Prentice-Hall, 1993.



Bibliography 485

T. Kailath. Linear systems. Prentice-Hall, Englewood Cliffs, N J 1980.

T. Kailath, Ali H. Sayed, and Babak Hassibi. State space estimation theory. Coursecompendium at Stanford university. To be published., 1998.

W . Kasprzak, H.Niemann, and D.Wetzel. Adaptive estimation procedures ordynamic road scene analysis. In IEEE International Conference Image Processing,1994. Proceedings. ICI P-9 4, pages 563-567, 1994.

I.E. Kazakov. Estimation and identification in systems with random structure. A u -tomatika i. Telemekanika, 8:59, 1979.

J.Y. Keller. Fault isolation filter design for linear stochastic systems. Automatica,35 10):1701-1706,1999.

J.Y. Keller and M. Darouach. Reduced-order Kalman filter with unknown inputs.

Automatica, 34 11):1463-1468, 1998.J.Y. Keller and M. Darouach. Two-stage Kalman estimator with unknown exogenousinputs. Automatica, 35 2):339-342, 1999.

T. Kerr. Decentralized filtering and redundancy management for multisensor navi-gation. IEEE Transactions on Aerospace and Electronic Systems, 23:83-119, 1987.

M. Kinnaert, R. Hanus, and P. Arte. Fault detection and isolation for unstable linearsystems. IEEE Transactions on Automatic Control , 40 4):740-742, 1995.

G. Kitagawa and H. Akaike. A procedure for the modeling of nonstationary time

series. An nal s of Institu te of Statistical Mathema tics,30:351-360, 1978.M. Koifman and I.Y. Bar-Itzhack. Inertial navigation system aided by aircraft dy-namics. IEEE Transa ctions on Cont rol Syste ms Technology,7 4):487-493, 1999.

E. Kreyszig. Introductory functional analysis with applications. John Wiley Sons,Inc., 1978.

V. Krishnamurthy and J.B. Moore. On-line estimation of hidden Markov modelparameters based on the Kullback-Leibler information measure. IEEE Transactionso n Signal Processing, pages 2557-2573, 1993.

K. Kumamaru, S. Sagara, and T. Soderstrom. Some statistical methods for faultdiagnosis for dynamical systems. n R. Patton, P. Frank, and R. Clark, editors, Faultdiagnosis in dyna mic system s Theory and application,pages 439-476. Prentice HallInternational, London, UK, 1989.

H.J. Kushner and J. Yang. Analysis of adaptive step-size SA algorithms for apram-eter trcking. IEEE Transactions on Automatic Control , 40:1403-1410, 1995.

H.J. Kushner and G. Yin. Stochasticapproximationalgorithmsandapplications.Springer-Verlag, New York, 1997.

D.G. Lainiotis. Joint detection, estimation, and system identification. Informat ion

and control, 19:75-92, 1971.

J.E. Larsson. Diagnostic reasoning strategies for means end models. utomatica, 305):775-787, 1994.



486 Bibliography

M. Larsson. Behavioral and Structural Model Based Approaches to Discrete Diag-nosis. PhD hesis, Linkopings Universitet, November 1999. Linkoping Studies nScience and Technology. Thesis No 608.

T . Larsson. A State-Space Partitioning Approach o Trellis Decoding. PhD thesis,Chalmers University of Technology, Gothemburg, Sweden, 1991.

A.F.S Lee and S.M. Hefhinian. A shift of the mean level in a sequence of independentnormal variables a Bayesian approach. Technometrics, 19:503-506, 1978.

W.C.Y Lee. Mobile Commu nications Engineering. McGraw-Hill, 1982.

E.L.ehmann. Testingtatisticalypothesis. Statistical/Probability series.Wadsworth Brooks/Cole, 1991.

X.R. Li and Y. Bar-Shalom. Design of an interactive multiple model algorithm forair traffic control tracking. IEEE Transactions on Control Systems Technology, 1:

186-194,1993.

L. Lindbom. A WienerFilteringApproach o heDesig n of Track ingAlgorithmsWith Applicat ions in Mobile Radio Commu nications. PhD thesis, Uppsala Univer-sity, November 1995.

D.V. Lindley. A statistical paradox. Biometrika, 44:187-192, 1957.

L. Ljung. Asymptotic variance expressions for identified black-box transfer functionmodels. IEEE Transact ions on Automat ic Control , 30934-844, 1985.

L. Ljung. Aspects on accelerated convergence in stochastic approximation schemes.In CDC’94, pages 1649-1652, Lake Buena Vista, Florida, 1994.

L. Ljung. System identif icat ion, theory for the user.Prentice Hall, Englewood Cliffs,NJ , 1999.

L. Ljung and T. Glad. Modeling and simulation. Prentice-Hall, 1996.

L. Ljung and S. Gunnarsson. Adaptation and tracking in system identificationsurvey. Automatica, 26:7-21, 1990.

L. Ljung and T. Soderstrom. Theory and practice of re cursive dentification. MITPress, Cambridge, MA, 1983.

D.G. Luenberger. Observers for multivariable systems. IEEE Transac t ions on Au-tomatic Control, 11:190-197, 1966.

T.P. MacGarty. Bayesian outlier rejection and st ate estimation. IEEE Transact ionson Automatic Control , 20:682, 1975.

J.M. Maciejowski. Multivariable feedback design. Addison Wesley, 1989.

J.F. Magni and P. Mouyon. On residual generation by observer and parity space

approaches. IEEE Transact ions on Automat ic Control , 39 2):441447, 1994.Mahajan, Muller, and Bass. New product diffusion models in marketing: a reviewand directions for research. Journal of Marketing, 54, 1990.



Bibliography 487

D.P. Malladi and J.L. Speyer. A generalized Shiryayev sequential probability ratiotest for change detection and isolation. IEEE Transactions on Autom atic Control,44 8):1522-1534, 1999.

C.L. Mallows. Some comments on cp. Technometrics,15:661-675, 1973.

R.S. Mangoubi. Robust estimation and failure detection. Springer-Verlag, 1998.

M.A. Massoumnia, G.C. Verghese, and A.S. Willsky. Failure detection and identifi-cation. IEEE Transactions on Automatic Control,AC-34 3):316-321, 1989.

P.S. Maybeck and P.D. Hanlon. Performance enhancement of a multiple modeladaptive estimator. IEEE TransactionsonAerospace and Electronic Systems, 314):1240-1254, 1995.

A. Medvedev. Continuous least-squares observers with applications. IEEE Trans-actions on Automatic Control,41 10):1530-1537, 1996.

R. Mehra, S. Seereeram, D. Bayard, and F. Hadaegh. Adaptive Kalman filtering,failure detection and identification for spacecraft attitude estimation. In Proceedingsof the 4th IEEE Conference on Control Applications,pages 176-181, 1995.

G. Minkler and J. Minkler. Theory and application of Kalm an filtering. MagellanBook Company, 1990.

L.A. Mironovskii. Functional diagnosis of linear dynamic systems. Automation andRemote Control,pages 1198-1205, 1980.

B. Mulgrew and C. F. N. Cowan. Adaptive filters and equalisers. Kluwer AcademicPublishers, Norwell, MA 1988.

D. Murdin. Data fusion and fault detection in decentralized navigation systems.Master Thesis LiTH-ISY-EX-1920, Department f Electrical Engineering, LinkopingUniversity, S-581 83 Linkoping, Sweden, 1998.

T. Nejsum Madsen and V. Ruby. An application or early detection of growthrate changes n the slaughter-pig production unit. Computers and Electronics inAgriculture, 25 3):261-270, 2000.

P. Newson and B. Mulgrew. Model based adaptive channel identification algorithmsfor digital mobile adio. In IEEE International Conferenceon Communications(ICC), pages 1531-1535, 1994.

M.Niedzwiecki. dentification of time-varying ystems with abrupt parameterchanges. In 9th IFAC/IFORSSymposium on Identification and System Parame-ter Identification, volume 11, pages 1718-1723, Budapest, 1991. IFAC.

M.Niedzwiecki and K. Cisowski. Adaptive scheme for elimination of broadbandnoise and impulsive disturbances from AR and ARMA signals. IEEE Transactions

on Signal Processing, 44 3):528-537, 1996.

R. Nikoukhah. Innovations generation in the presence of unknown inputs: Applica-tion to robust failure detection. Autornatica, 30 12):1851-1867, 1994.



488 Bibliography

B.M. Ninness and G.C. Goodwin. Robust fault detection based on ow order models.Proceedings of IF A C Safeprocess 99 Symposium, 1991.

B.M. Ninness and G.C. Goodwin. Improving he power of fault testing using reduced

order models. In Proceedings of the IEEE Conference on Control and Instrumenta-t ion, Singapore, February, 1992.

M. Nyberg. Automatic design f diagnosis systems with application o an automotiveengine. Control Engineering Practice, 7 8):993-1005, 1999.

M. Nyberg and L. Nielsen. Model based diagnosis for the air intake system of theSI-engine. S A E P ap er, 970209, 1997.

E.S. Page. Continuous inspection schemes. Biometrika, 41:lOO-115, 1954.

PooGyeon Park and T. Kailath. New square-root algorithms for Kalman filtering.

IEEE Transactions on Automatic Control , 40 5):895-899, 1995.E. Parzen, editor. Ti m e series analysis of irregularlyobserveddata, volume 25ofLecture Notes in Statistics. Springer-Verlag, 1984.

R. Patton, P. Frank, and R. Clark. Faultdiagnosis in dynamicsystems. PrenticeHall, 1989.

R.J. Patton. Robust model-based fault diagnosis: the state of the art. In I FA CFault Detection, Supervisions and Sefety for Technical Processis,pages 1-24, ESPOO,Finland, 1994.

M. Pent, M.A. Spirito, and E. Turco. Method for positioning GSM mobile sta-tions using absolute time delay measurements. lectronics Letters,33 24):2019-2020,1997.

D.N. Politis. Computer-intesive methods in statistical analysis. IEEE Signal Pro-cessing Magazine, 15 1):39-55, 1998.

B.T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by aver-aging. SIAM Journa l of Control and Optimization, 30:838-855, 1992.

J.E. Potter. New statistical formulas. Technical report, Memo 40, InstrumentalLaboratory, MIT, 1963.

J.G. Proakis. Digital commun ications. McGraw-Hill, 3rd edition, 1995.

J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471, 1978.

J. Rissanen. Estimation of structure by minimum description length. Circuits,Sy ste ms and Signal Processing, 1:395-406, 1982.

J. Rissanen. Stochastic complexity and modeling. TheAn nal s of Stat ist ics, 14:1080-1100,1986.

J. Rissanen. Stochastic complexity in statistical inquiry. World Scientific, Singapore,

1989.

S.W. Roberts. Control charts based on geometric moving averages. Technometrics,8:411-430, 1959.



Bibliography 489

E.M. Rogers. Diffusion of innovations. New York, 1983.

K.-E. Arzkn. Integrated control and diagnosis of sequential processes. Control En-gineering Practice, 4 9):1277-1286, 1996.

Y.Sato. A method of self-recovering qualization ormultilevel mplitude-modulation systems. IEEE Transactions on Communications,23:679-682, 1975.

D. Sauter and F. Hamelin. Frequency-domain optimization for robust fault detectionand isolation in dynamic systems. IEEE Transactions on Autom atic Control,44 4):878-882, 1999.

A.H. Sayed and T. Kailath. A state-space approach o adaptive RLS filtering. IEEESignal Processing Magazine, 11 3):18-60, 1994.

J. Scargle. Wavelet methods in astronomial time series analysis. In Applications oftim e series analysis in astronomy and metrology, page 226. Chapman Hall, 1997.

G. Schwartz. Estimating the dimension of a model. Annals of Statistics,6:461-464,1978.

L.S. Segal. Partitioning algorithms with applications to economics. PhD thesis, StateUniversity of New York ar Buffalo, 1979.

J. Segen and A.C. Sanderson. Detecting changes in time-series. IEEE Transactionson Information Theory, 26:249-255, 1980.

A. Sen and M.S. Srivastava. On tests for detecting change in the mean. Annals ofStatistics, 3:98-108, 1975.

N. Seshadri and C-E.W. Sundberg. Generalized Viterbi algorithms for error de-tection with convolutional codes. In IEEE GLOBECOM 89Conf.Record, pages1534-1537,1989.

R. Settineri, M. Najim, and D. Ottaviani. Order statistic fast Kalman filter. InIEEE International Symposium on Circuits and Systems, i996 . ISC AS 96,pages116-119, 1996.

N. Shephard and M.K. Pitt. Likelihood analysis of non-gaussian measurement timeseries. Biometrika, 84:653-667, 1997.

J.J. Shynk. Adaptive IIR filtering. IEEE SignalProcessingMagazine, 6 2):4-21,1989.

M.G. Siqueira and A.A. Alwan. Steady-state analysis of continuous adaptationsystems for hearing aids with a delayed cancellation path. In Proceedings of the32nd Asilomar Conference on Signals, Systems,and Computers, pages 518-522,1998.

M.G. Siqueira and A.A. Alwan. Bias analysis in continuous adaptation systems for

hearing aids. In Proceedings of ICASSP 99,pages 925-928, 1999.

B. Sklar. Rayleigh fading channels in mobile digital communication systems. IEEECommunications Magazine,35 7), 1997.



49 Bibliography

D.T.M. Slock. On the convergence behavior of the lms and the normalized msalgorithms. IEEE Transactions o n Signal Processing, 41 9):2811-2825, 1993.

A.F.M Smith. A Bayesian approach to inference about a change-point in a sequence

of random variables. Biometrika, 62:407-416, 1975.A.F.M. Smith and G.O. Roberts. Bayesian computation via the Gibbs sampler andrelated Markov chain Monte Carlo methods. Journal of Royal Statistics Society,Series B, 55 1):3-23, 1993.

T. Soderstrom and P. Stoica. System identification.Prentice Hall, New York, 1989.

A. Soliman, G. Rizzoni, and Y. W. Kim. Diagnosis f an automotive emission controlsystem using fuzzy inference. Control Engineering Practice, 7 2):209-216, 1999.

J.M. Spanjaard and L.B. White. Adaptive period estimation of a class of periodicrandom processes. n International Conference on Acoustics,Speech, and SignalProcessing, 1995. ICA SSP -95,pages 1792-1795, 1995.

P. Strobach. New forms of Levinson and Schur algorithms. IEEE Signal ProcessingMagazine, 8 1):12-36, 1991.

T. Svantesson, A. Lauber, and G. Olsson. Viscosity model uncertainties in an ashstabilization batch mixing process. n IEEE Instrumentation nd Measurement Tech-nology Conference, Baltimore, MD, 2000.

M. Tanaka and T. Katayama. Robust identification and smoothing or linear system

with outliers and missing data. In Proceedings o 11th IFAC World Congress, pages160-165, Tallinn, 1990. IFAC.

M. Thomson, P. M. Twigg, B. A. Majeed, and N. Ruck. Statistical process controlbased fault detection of CHP units. Control Engineering Practice, 8 1):13-20, 2000.

N F. Thornhill and T. Hagglund. Detection and diagnosis of oscillation in controlloops. Control Engineering Practice, 5 10):1343-1354, 1997.

F. Tjarnstrom. Quality estimation of approximate models. Technical report, De-partment of Electrical Engineering, Licentiate thesis no. 810, Linkoping University,

2000.J.R. Treichler, Jr. C.R. Johnson, and M.G. Larimore. Theory and design of adaptivefilters. John Wiley Sons, New York, 1987.

J.R. Treichler, I. Fijalkow, and C.R. Johnson Jr. Fractionally spaced equalizers.IEEE Signal Processing Magazine, 13 3):65-81, 1996.

J.K. Tugnait. A detection-estimation scheme for state estimation in switching envi-ronment. Automatica, 15:477, 1979.

J.K. Tugnait. Detection and estimation for abruptly changing systems. Automatica,18:607-615, 1982.

J. Valappila and C. Georgakisa. Systematic estimation of state noise statistics forextended Kalman filters. AIChE Journal, 46 2):292-308, 2000.



Bibliography 491

S. van Huffel and J . Vandewalle. The total least squares problem.SIAM, Philadelphia,1991.

G. Verghese and T. Kailath. A further note on backwards Markovian models. IEEE

Transactions on Inform ation Theory, 25:121-124, 1979.N. Viswanadham, J.H. Taylor, and E.C. Luce. A frequency-domain approach tofailure detection and solation with application to GE-21 turbine engine controlsystems. Control Theory and Advanced Technology, 3 1):45-72, March 1987.

A.J. Viterbi. Error bounds for convolutional codes and an asymptotically optimumdecoding algorithm. IEE E Transactions on Informa tion Theory, 13:260-269, 1967.

Wahlbin. A new descriptive model of interfirm diffusion ofnew echniques. nE A A R M X I Annual Conference,pages 1-2 43-49, Antwerpen, 1982.

A . Wald. Sequential analysis. John Wiley Sons, New York, 1947.

A . Wald. Statistical decision functions. John Wiley Sons, New York, 1950.

M. Waller and H. Saxch. Estimating the degree of time variance in a parametricmodel. Autornatica, 36 4):619-625, 2000.

H. Wang and S. Daley. Actuator fault diagnosis: an adaptive observer-based tech-nique IEEE Transactions o n Automatic Control 41 7):1073-1078 1996

Adaptive Filtering and Change Dectection

Documents

Transcript of Adaptive Filtering and Change Dectection