Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv...

189
Research Collection Doctoral Thesis Input Estimation and Dynamical System Identification: New Algorithms and Results Author(s): Bruderer, Lukas Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010538752 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Transcript of Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv...

Page 1: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Research Collection

Doctoral Thesis

Input Estimation and Dynamical System Identification: NewAlgorithms and Results

Author(s): Bruderer, Lukas

Publication Date: 2015

Permanent Link: https://doi.org/10.3929/ethz-a-010538752

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Page 2: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Series inSignal andInformationProcessing

Volume 27

HartungGorre

Konstanz

Diss. ETH No. 22575

Input Estimation andDynamical SystemIdentification:New Algorithms andResults

A dissertation submitted toETH Zurichfor the degree ofDoctor of Sciences

presented by

Lukas BrudererDipl. El.-Ing. ETHborn on January 24, 1985citizen of Speicher, AR

accepted on the recommendation ofProf. Dr. Hans-Andrea Loeliger, examinerProf. Dr. Bernard Fleury, co-examiner

2015

Page 3: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Series in Signal and Information Processing Vol. 27

Editor: Hans-Andrea Loeliger

Bibliographic Information published by Die Deutsche Nationalbibliothek

Die Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;

detailed bibliographic data is available in the internet at http://dnb.d-nb.de.

Copyright © 2015 by Lukas Bruderer

First Edition 2015

HARTUNG-GORRE VERLAG KONSTANZ

ISSN 1616-671X

ISBN-10: 3-86628-533-7

ISBN-13: 978-3-86628-533-0

Page 4: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Acknowledgments

I remember well, the first discussion with my supervisor Hans-AndreaLoeliger before starting my PhD studies. It was hard not to be caughtby his enthusiasm while he sketched a number of possible directions ofresearch. When on the road towards this thesis new paths - some of themoff the beaten track - emerged, Andi was always supportive and swiftlybrought up those “right” questions that would lead from one thing toanother. Later I was consistently amazed by his ability to express ourwork and conceptual ideas in a clear and well-understandable way.Next, I would like to thank Bernard Fleury for accepting to co-examiningmy thesis. I recall a discussion with him in 2013 at a workshop in AalborgDenmark that sparked my curiosity in a Bayesian approach to sparserecovery. In hindsight, it marked the beginning of fascinating work thatis now a substantial part of this thesis.A portion of this thesis was developed and inspired by the CTI1-project“Adaptive Filtererung von Zerspankraftsignalen”, a cooperation of ETHZurich and Kistler AG. I acknowledge the experimental data that I havereceived from Daniel Spescha, Josef Stirnimann, and Friedrich Kusterfrom Inspire AG and the Institute of Machine Tools and Manufacturing.Working on this interdisciplinary project was an unique opportunity toput oneself in the end user’s position and learn a lot about sensor tech-nology. I benefited a lot from thought-provoking meetings with ManuelBlatter, Thomas Wuhrmann, and Bernhard Bill from Kistler and thelatter persons from Inspire.Of major significance to the great experience that were the PhD studies,were the colleagues at the lab. I am grateful for the many valuable dis-

1Commission for Technology and Innovation

iii

Page 5: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

iv

cussions - on research topics or off-topic - some in front of whiteboardsand some just next to the ISI espresso machine. At ISI, it was neverhard to gain support for a project or join forces. In particular, it wasinspiring to collaborate with Christoph Reller, Jonas Biveroni, GeorgWilckens, Sarah Neff, Nour Zalmaï, and Federico Wadehn. It is a plea-sure to acknowledge Federico, who provided valuable comments on thisdocument. Last but not least, I am highly indebted to Hampus Malm-berg. The results in his master’s thesis and the variety of open questionshe elaborated and discussed contributed significantly to this thesis.Apart from my fellow PhD students I would also like to thank RitaHildebrand for saving me worries about all sorts of administrative mat-ters and Patrik Strebel for usually having a solution ready before theissue had even been raised.Special thanks go to my semester thesis and master’s thesis students. Inparticular, Filipe Barata, Ismail Celebi, and Christian Käslin for theircontributions to the CTI project and all the other students for the di-verse and very positive experiences and learnings that I could gather assupervisor.My deepest thank goes to my girlfriend Lisa for her relentless support,her considerate nature, and her patience during all stages of my PhD.You always managed to pronounce precisely what I needed to hear tochart my own path.A special thanks goes to my parents who supported me throughout mystudies. Even though I was often away and airheaded, they were alwaysthere for me.

Page 6: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Abstract

Recovery of signals from distorted or noisy observations has been a long-standing research problem with a wide variety of practical applications.We advocate to approach these types of problems by interpreting themas input estimation in finite-order linear state-space models. Amongother applications the input signal may represent a physical quantityand the state-space model a sensor yielding corrupted readings.In this thesis, we provide new estimation algorithms and theoretical re-sults for different classes of input signals: continuous-time input signalsand weakly sparse input signals. The latter method is obtained by spe-cializing a more general framework for inference with sparse priors andsparse signal recovery, which in contrast to standard methods, amountsto iterations of Gaussian message passing. Applicability of input estima-tion is extended to complex models, which generally are computationallymore demanding and may be prone to numerical instability, by introduc-ing new numerically robust computation methods expressed as Gaussianmessage passing in factor graphs.In practical applications, a signal model may not necessarily be availablea-priori. As a consequence, in addition to input estimation, estimationof the state-space model itself must also be adressed. To this end, weintroduce a variational statistical framework to retrieve convenient state-space models for input estimation and present a joint input and modelestimation algorithm for weakly sparse input signals.The proposed methods are substantiated with two real world applicationexamples. First, we consider impaired mechanical sensor measurementsin machining processes and show that input estimation and suitablemodel identification can result in more accurate measurements, whenstrong resonances distort the sensor readings. Secondly, we show that

v

Page 7: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

vi Abstract

our simultaneous weakly-sparse input estimation and model estimationmethod is capable of identifying individual heart beats from ballistocar-diographic measurements, a method used to measure non-invasively thecardiac output.

Keywords: Gaussian Message Passing; State-space models; FactorGraphs; Sparse Estimation; System Identification.

Page 8: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Kurzfassung

Die Schätzung von Signalen aus verzerrten oder verrauschten Beobach-tungen ist ein langjähriges Forschungsproblem mit einer Vielzahl pra-xisrelevanter Anwendungen. Um diese Art von Problemen anzugehen,interpretieren wir sie als Eingangsschätzungen von linearen Zustands-raummodellen endlicher Ordnung. Ein konkretes Anwendungsbeispiel fürunseren Ansatz sind physikalische Messgrössen - die Eingangssignale -und die entsprechenden Sensoren, welche durch Zustandsraummodelledargestellt werden.In der vorliegenden Dissertation führen wir neue Schätzmethoden undsowohl theoretische als auch experimentelle Ergebnisse für verschiede-ne Klassen von Eingangssignalen ein: zeitkontinuierliche Eingangssignaleund schwach spärliche Eingangssignale. Die Schätzmethode für Letztereist ein Spezialfall einer Inferenz-Methode, welche für die Rückgewinnungvon spärlichen Signalen und sich im Gegensatz zu anderen Standard-methoden auf iteratives Gauß’sches Message-Passing reduzieren lässt.Dank der Einführung numerisch robuster Gauss’schen Message-PassingVerfahren kann die Anwendbarkeit der Eingangsschätzungsmethoden fürkomplexere Modelle, die eine höhere Modellordnung aufweisen und des-halb in der Regel rechnerisch anspruchsvoller und für numerische Fehleranfällig sind, gewährleistet werden.Üblicherweise steht in der Praxis a-priori kein Signal-Modell zur Verfü-gung, weshalb nebst der Eingangsschätzung auch die Identifikation ef-fektiver Zustandsraummodelle nötig ist. Zu diesem Zweck stellen wir einvariationales statistisches Modell vor, welches geeignete Zustandsraum-Modelle für die Eingangsschätzung garantiert, und präsentieren einenModellschätzungsalgorithmus für schwach spärliche Eingangssignale.Die Anwendbarkeit der neuen Verfahren wird mit zwei realen Anwen-

vii

Page 9: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

viii Kurzfassung

dungsbeispielen untermauert. Zunächst betrachten wir die Entzerrungdynamischen Kraftmessungen während Fräsprozessen und zeigen, dassEingangsschätzung mit geeigneter Modellidentifikation zu einer höherenGenauigkeit bei den Messwerten führt. Zweitens wenden wir die gleich-zeitige Schätzung schwach spärlicher Eingangssignale und von Zustands-raummodellen auf ballistocardiographische Messungen, einer nicht inva-siven Methode zur Messung der Herzleistung, an. Wir zeigen, dass dieeingeführte Methode die Extraktion von Herzschlägen und damit dieBerechnung relevanter physiologischer Parameter ermöglicht.

Stichworte: Gauss’sches Message-Passing; Zustandsraummodelle;Schätzung von spärlichen Signalen; Faktor-Graphen; Modellschätzung.

Page 10: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Contents

Abstract v

Kurzfassung vii

1 Introduction 1

1.1 Contributions and Overview . . . . . . . . . . . . . . . . . 2

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Input Estimation . . . . . . . . . . . . . . . . . . . 4

1.2.2 System Identification . . . . . . . . . . . . . . . . . 5

2 Preliminaries 7

2.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . 7

2.2 Expectation-Maximization Method . . . . . . . . . . . . . 10

2.2.1 Convergence Properties . . . . . . . . . . . . . . . 10

2.2.2 Gradient-Based Likelihood Optimization . . . . . . 12

2.2.3 Expectation-Maximization Acceleration Methods . 12

2.2.4 Expectation-Maximization-based Estimation ofState Space Models . . . . . . . . . . . . . . . . . 13

2.3 Wiener filter . . . . . . . . . . . . . . . . . . . . . . . . . 13

ix

Page 11: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

x Contents

3 Numerically Stable Gaussian Message-Passing 17

3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Decimation of Low-Pass-Type Systems . . . . . . . . . . . 18

3.3 State-Space Reparametrization . . . . . . . . . . . . . . . 20

3.3.1 Numerical Robustness . . . . . . . . . . . . . . . . 22

3.3.2 State-Space-Model Identification . . . . . . . . . . 25

3.4 Square-Root Message Passing . . . . . . . . . . . . . . . . 25

3.4.1 Gaussian Square-Root Message Passing . . . . . . 26

3.4.2 Computational Optimization . . . . . . . . . . . . 32

3.4.3 Expectation-Maximization Updates . . . . . . . . 32

3.4.4 Experimental Results . . . . . . . . . . . . . . . . 35

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Gaussian Message Passing with Dual-Precisions 39

4.1 Dual-Precision . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Message Passing Tables . . . . . . . . . . . . . . . . . . . 41

4.3 Algorithms Based on W and Wµ . . . . . . . . . . . . . . 41

4.3.1 Smoothing in SSMs . . . . . . . . . . . . . . . . . 41

4.3.2 Steady-State Smoothing . . . . . . . . . . . . . . . 45

4.3.3 E-Step in SSMs . . . . . . . . . . . . . . . . . . . . 47

4.3.4 Continuous-Time Smoothing . . . . . . . . . . . . 48

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Input Signal Estimation 51

5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Computation and Continuity of u(t) . . . . . . . . . . . . 53

5.3 Wiener Filter Perspective . . . . . . . . . . . . . . . . . . 54

5.4 Postfiltering . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . 57

Page 12: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Contents xi

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Input Estimation for Force Sensors 61

6.1 Problem Setup and Modeling . . . . . . . . . . . . . . . . 62

6.2 Model-Based Input Estimation . . . . . . . . . . . . . . . 64

6.2.1 Model-Based Input Estimation . . . . . . . . . . . 64

6.2.2 Input estimator implementation . . . . . . . . . . 66

6.3 Sensor Model Identification . . . . . . . . . . . . . . . . . 67

6.3.1 Tuning of Identification Parameters . . . . . . . . 71

6.3.2 Implementation . . . . . . . . . . . . . . . . . . . . 72

6.3.3 Improvements of Model-Identification Algorithm . 72

6.4 Frequency-Based MMSE Filters . . . . . . . . . . . . . . . 76

6.4.1 Frequency-Based Filtering . . . . . . . . . . . . . . 76

6.4.2 Frequency Response Function Estimation . . . . . 78

6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 80

6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.5.1 System Identification Results . . . . . . . . . . . . 82

6.5.2 Force Estimation Performance . . . . . . . . . . . 85

6.5.3 Convergence Properties . . . . . . . . . . . . . . . 86

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7 Sparse Bayesian Learning in Factor Graphs 91

7.1 Variational Prior Representation . . . . . . . . . . . . . . 91

7.1.1 Multi-Dimensional Features . . . . . . . . . . . . . 93

7.2 Sparse Bayesian Learning in Factor Graphs . . . . . . . . 93

7.3 Multiplier Optimization . . . . . . . . . . . . . . . . . . . 96

7.4 Fast Sparse Bayesian Learning . . . . . . . . . . . . . . . 99

7.4.1 Multi-Dimensional Fast Sparse Bayesian Learning 102

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Page 13: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

xii Contents

8 Sparse Input Estimation in State Space Models 111

8.1 Sparse Input Estimation . . . . . . . . . . . . . . . . . . . 111

8.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 113

8.1.2 Simulation Results . . . . . . . . . . . . . . . . . . 113

8.2 Blind Deconvolution . . . . . . . . . . . . . . . . . . . . . 116

8.2.1 Type-I Estimators vs. Type-II Estimators . . . . . 116

8.2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 118

8.2.3 Simulation Results . . . . . . . . . . . . . . . . . . 120

8.2.4 Heart Beat Detection for Ballistocardiography . . 122

8.3 Conclusions and Outlook . . . . . . . . . . . . . . . . . . 125

A Algorithm Statements 127

A.1 Square-Root Smoothing . . . . . . . . . . . . . . . . . . . 127

A.2 Continuous-Time Posterior Computation . . . . . . . . . . 127

A.3 System Identification in Chapter 6 . . . . . . . . . . . . . 128

A.4 Sparse Input Estimation . . . . . . . . . . . . . . . . . . . 130

B Proofs 133

B.1 Proofs for Chapter 3 . . . . . . . . . . . . . . . . . . . . . 135

B.2 Proofs for Chapter 4 . . . . . . . . . . . . . . . . . . . . . 136

B.3 Proofs for Chapter 5 . . . . . . . . . . . . . . . . . . . . . 137

B.4 Proofs for Chapter 6 . . . . . . . . . . . . . . . . . . . . . 140

B.5 Proofs for Chapter 7 . . . . . . . . . . . . . . . . . . . . . 142

C Spline Prior 147

C.1 Relation to Splines . . . . . . . . . . . . . . . . . . . . . . 148

D Additional Material on Dynamometer Filtering 151

D.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 151

D.2 Measured Frequency Responses . . . . . . . . . . . . . . . 152

Page 14: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Contents xiii

E Learnings from Implementing Gaussian Message PassingAlgorithms 155

F Multi-Mass Resonator Model 159

Bibliography 163

About the Author 171

Page 15: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 16: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 1

Introduction

A fundamental and thoroughly investigated problem in signal processingis the estimation or equalization of signals given disturbed or distortedmeasurements. We focus on a related class of problems where the mea-surement processes or sensors can be modeled with finite-order linearstate-space models (SSM) and promote a SSM-based approach to esti-mation.SSMs are powerful tools that have been an essential part of many prac-tical and theoretical developments over the last decades. In an SSM allinformation is contained in a finite-dimensional state variable. The stateis usually not directly observable, but it evolves in time according to atransition law. In practical application, SSMs provide flexibility withrespect to measurement or sampling schemes and often permit relativelypainless handling of complex, even mildly non-linear models.The estimation problems we consider can be expressed as state or inputestimation in SSMs, where the SSM might not be restricted to merelybeing an accurate model of the measurement device. With our approach,equalization of an observed signal is considered an input estimation inSSMs. However, the input estimation method gives access to a widerrange of interesting problems, some of which are shown in this thesis.When confronted with actual tasks and real world applications, somequestions arise. The most apparent one is how to obtain an SSM, whennone is given a-priori. This task is known as system identification. Asa wide choice of system identification methods is available, the mainchallenge is commonly to select the method that suits best the task forwhich the model is needed.Once a model is set, we take a Bayesian stance and attempt to estimate

1

Page 17: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

2 Introduction

the inputs to the model that explain best the present observations. Onechallenge that is faced in doing so, are high computational requirementsof common methods; often along with numerically stability problems.In many practical applications, no prior knowledge about the actualinput signal is available or biasing the estimates is not desired. Thusinput estimates based on the weakest possible assumptions are soughtfor. For discrete-time input signals there exist well-known estimators.Continuous-time input estimation, on the other hand, is not well under-stood. How can we obtain estimators and do these estimators behaveaccording to our expectations?From an algorithmic point of view, linear SSMs with Gaussian distur-bances share an intimate relationship with Gaussian inference methods,notably Gaussian message passing. These methods distinguish them-selves by the possibility to decompose the computations into simple lo-cal computations and, under suitable circumstances, by yielding an exactsolution in a fixed number of steps. When dealing with non-Gaussianstatistical models, which allow to introduce useful types of prior knowl-edge about the inputs, we seek Gaussian message passing algorithms toperform (approximately) the computations.In this thesis, we will touch on many of these points and aspects, whichlead to new algorithms and results. Our contributions are summarizednext.

1.1 Contributions and Overview

In Chapter 2 we introduce the notation and necessary background onexpectation maximization (EM).

Input Estimation

• We present a model-based filter for estimation of undistorted mea-surements in an industrial application (Section 6.2.1 and a three-dimensional Wiener filter Section 6.4). The estimation perfor-mance is evaluated with extensive measurement data and trade-offs in terms of estimation performance and implementation arediscussed for our model-based filter approach.

Page 18: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

1.1 Contributions and Overview 3

• We demonstrate the use of sparsifying (“compressible”) priors withvariational representations for input signals of state space modelssuch that the resulting estimation algorithm amounts to Gaussianmessage passing.

• We devise a Gaussian message-passing scheme for message passingin SSMs and EM, offering attractive numerical stability properties,that uses square-root Kalman filtering methods (Section 3.4).

• A state-space basis construction method is presented that, typi-cally, improves numerical robustness of Gaussian message passingand is readily applicable (Section 3.3).

• Extending previous message passing schemes and ideas, we pro-pose a new pertinent and very efficient Gaussian message passingscheme in SSMs (Chapter 4) that does not require any matrix in-version, if the SSM’s output dimension is 1.

• Several new theoretical and experimental results on a continuous-time white Gaussian noise estimator, proposed in previous work,are presented (Chapter 5).

System Identification

• We show a new numerically more stable Gaussian message passingscheme for SSMs that uses matrix square-roots of the covariances.The improved numerical properties are substantiated with experi-mental results (Section 3.4.3).

• Based on the proposed inversion-free message passing method, EMcomputations are derived that are free of matrix-inversions. Thelatter property is essential to improve numerical stability and as-sures a reduced computational overhead (Section 4.3.3).

• A result (Theorem 6.1) demonstrates that maximum-likelihoodestimation seems well suited to identify systems that are subse-quently employed for input estimation. Using a non-probabilisticview, insight on parameter selection for the proposed identificationmethod is given (Section 6.3.1).

• We derive an algorithm for blind system identification (Section 8.2),where the actual computations amount to an iterative application

Page 19: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

4 Introduction

of Gaussian message passing. We then demonstrate and compareits performance on a synthetic problem and show its effectivenesson finding heart beat times in ballistocardiographic measurements,a challenging real-world physiological signal-processing problem.

1.2 Related Work

1.2.1 Input Estimation

Input estimation problems are usually addressed by beginning with ad-ditional assumptions on the unknown, potentially random, input signal.A classical approach is to assume that the desired signal is low-pass fil-tered Gaussian noise and to estimate it either by a Wiener filter or bya Kalman smoother [38]. These methods extend to much more generalGaussian priors, such as e.g., splines [73], but consequently also lead tomore complex estimators. Joint state estimation and input estimationfor discrete-time linear SSM has been analyzed in [26], which essentiallycorresponds to discrete-time input estimation using a non-informativeprior. Estimators for continuous-time input estimation were presentedin [9] with a smooth prior on the input process and in [10] with non-informative white Gaussian noise prior.The methods we propose for estimating weakly sparse or compressiblesignals, certainly fit into the collective class of Sparse Bayesian learn-ing (SBL) methods, which were initiated with the seminal paper [67]and have since found applications in signal processing [23] and commu-nications [65]. However, it seems that this powerful class of models has(to the best of our knowledge) never been used in the context of lineardynamic models.Approaches that are related to (weakly) sparse input estimation assumethe input to be sparse with respect to some basis, as, e.g., in [71] andthen either solve the combinatorial problem with exact sparsity measuresor a more tractable problem using sparsity-promoting regularizers. Forthe input estimation problem in SSMs, the application of non-Gaussianpriors, in particular sparsity-promoting priors, has not received a lotof attention in the past. Related approaches essentially use heavy-tailedpriors on the inputs of discrete-time SSMs to devise (statistically) robustKalman filters [19], but do not try to recover the input signals.

Page 20: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

1.2 Related Work 5

A different Bayesian approach to estimation of sparse processes observedthrough a finite-order model, is given in [3] for continuous-time sparsesignals. The presented algorithms, based on spline theory, however, areonly applicable to very simple SSM (i.e., SSMs with order at most 1).Blind deconvolution problem have been solved with a variety of meth-ods. A variety of related methods express blind deconvolution as anoptimization problem with sparsity-promoting regularizers, e.g., blindsource separation [22, 85] and dictionary learning [1]. The Bayesian ap-proach to such problems were first introduced in image processing [21]and formulated with mean-field variational SBL in [4], where a scale mix-ture representation of the LASSO is used to blindly estimate image blursmodeled as two-dimensional finite impulse-response (FIR) filters. TheBayesian approach provides not only an estimate of the sparse variables,but also reliability information of the posterior (e.g., posterior variance),which can be essential for blind deconvolution [41].

1.2.2 System Identification

A wide variety of methods for identification of SSM have been developed.The most widely used ones are:

• Subspace system identification methods [68] seek an SSM thatapproximately spans the subspace generated by the given input-output measurements. The principal quantity is the Hankel ma-trix of the identification data. These methods can easily handleadditional complications such as e.g., non-zero initialization (tran-sients), but unfortunately lack a simple error criterion or proba-bilistic interpretation.

• Prediction-error methods [43] are often the preferred method forsystem identification of SSM. Prediction-error methods minimizethe prediction error in an SSM in an online fashion. Asymptoti-cally, these methods minimize the absolute model error weightedby the relative spectral magnitude of the input signal. Unfortu-nately, minimization of the error criterion for SSMs requires towork with innovation form of an SSM, which is not the same SSMas the ones that are considered here.

While under certain circumstances prediction error-methods asymptoti-cally approximate the maximum-likelihood (ML) estimators, direct ML-

Page 21: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6 Introduction

based estimation does not seem very popular in system identification. Incase of the general stochastic SSMs that we consider, however, predic-tion error methods can not be used and the complex ML problem canoften only be solved approximately (not asymptotically though) withEM-based algorithms. We present an overview of prior work on EM inSSM identification in Section 2.2.

Page 22: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 2

Preliminaries

2.1 Notation and Definitions

We write matrices in boldface (e.g., A) and vectors in italic boldface(e.g., a). We will usually utilize uppercase for random variables andmatrices. For a matrix A, AT, A−1, det (A), and Tr A denote itstranspose, its inverse, its determinant, and trace (sum of its diagonalelements). The Kronecker product between two matrices A and B isdenoted as A ⊗ B. Superscripts in squared brackets, e.g., x[t], denotethe value of variable x in iteration t, typically in an iterative algorithm.A normal distribution with mean m and variance σ2 is denoted byN (m,σ2). The probability density function (pdf) of a normal distri-bution is denoted by N (x|m,σ2).

Definition 2.1: Linear discrete-time SSMA linear discrete-time SSM of order d with states xk ∈ Rd, inputs uk ∈Rm, and outputs yk ∈ Rn satisfies the following equation for all k ∈ Z:

xk+1 = Akxk + Bkuk (2.1)yk = Ckxk. (2.2)

an SSM is specified by its parameters: the state-transition matrix Ak ∈Rd×d and the matrices Bk ∈ Rd×m and Ck ∈ Rn×d. If Ak = A, Bk = B,and Ck = C for all k ∈ Z, the SSM is time-invariant.

Often, Gauss-Markov statistical models will be used.

7

Page 23: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8 Preliminaries

Definition 2.2: Linear stochastic SSMsWe define a linear stochastic SSM of order d as the stochastic processgenerated by

Xk+1 = AkXk + BkUk (2.3)Yk = CkXk +Nk, (2.4)

i.e., a dth order linear discrete-time SSM with parameters Ak, Bk,and Ck, with independent Gaussian random variables Uk ∼ N (uk,VUk)and Nk ∼ N (0,VNk) for all k ∈ Z. The model is time-invariant, if theSSM is time-invariant and VUk = VU and VNk = VN . Notice that thisdefinition does not require the mean of Uk to be time-invariant.

The stochastic SSM is linear and Gaussian. Stochastic processes withthe same distribution as Xk or Yk in (2.3) belong to the class of Gauss-Markov random processes (i.e., Gaussian random processes with Markovproperties). We refer the reader to [58] for an interesting treatment ona more general case.The models defined in Definition 2.1 and 2.2, are multiple-input multiple-output models. In many instances, we consider single-input single-output SSM: Vectors bk and ck replace Bk and Ck at the SSMs’ inputsand outputs. The dimension of the variances σ2

Ukand σ2

Nkare changed

accordingly.A factor graph representation of the stochastic SSM from Definition 2.2 isshown in Figure 2.1. Factor graphs are graphical models that representfunction factorizations. In Forney factor graphs, nodes of the graphdenote factors, whereas edges correspond to the variables. We refer thereader to [40, 44] for a general introduction to factor graphs and [59,Section 1.5.2] for a description of the notation used in this thesis. Whenthere are multiple edges that represent one state Xk in the factor graphof an SSM, we distinguish between the edge variable to the left of an “=”-factor and to the right, by denoting them Xk and X ′k respectively.Stochastic SSM will be used as probabilistic model throughout this thesisand, in this case, the factor graph representation provides a concise andversatile tool to facilitate and describe algorithms to compute variousstatistical quantities. In particular, we will be interested in marginalprobability densities. To this end, we will make heavy use of standardrules and relations from [47].Experimental results are generally compared with respect to their nor-

Page 24: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

2.1 Notation and Definitions 9

· · · Ak

X ′k−1 +

Bk

N (0,VU )

Uk−1

=Xk · · ·

X ′k

Ck

+N (0,VN )

Yk

Figure 2.1: Factor graph representation of a stochastic SSM.

malized mean-squared error (NMSE). The NMSE of an estimated (vec-tor) w with respect to the true (vector) value w is

NMSE , ‖w −w‖2

‖w‖2. (2.5)

When comparing different estimation methods, another metric is theMSE improvement factor with respect to a baseline estimate e.g., un-processed observations y. Using the definitions from (2.5) the factor isgiven by

∆MSE , ‖w −w‖2

‖y −w‖2. (2.6)

Note that NMSE and ∆MSE do not depend on the absolute magnitudeof the signals.

Page 25: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

10 Preliminaries

2.2 Expectation-Maximization Method

Expectation maximization (EM) [6, 17] is a widely-used iterative tech-nique for parameter estimation. It may be seen as an instance of ma-jorization minimization (MM) techniques [35,82], a class of optimizationmethods. MM methods minimize (maximize) a function by alternativelyconstructing an upper-bounding (lower-bounding) surrogate function tothe objective and then minimizing (maximizing) the surrogate function.More specifically, suppose we wish to compute

θ , argmaxθ

f(θ), (2.7)

with a positive, not necessarily convex or unimodal, function f(θ). Afunction g(θ|θ) is said to minorize f(θ) if i) f(θ) = g(θ|θ) and ii) f(θ) ≥g(θ|θ).In particular, the EM technique applies when f(θ) = log l(θ) is a log-likelihood and there exist latent or missing variables x and a probabilitymeasure p(x|θ) such that

l(θ) =∫p(x|θ)dx.

Then EM at iteration k with current estimate θ[k], proceeds by con-structing a lower bound Q(θ|θ[k]) to f(θ) (expectation step or E-step)

Q(θ|θ[k]) =∫p(x|θ[k]) log p(x|θ)dx = Ep(x|θ[k])[log p(x|θ)] ,

followed by maximizing the surrogate function (maximization step orM-step) to obtain the new estimate θ[k+1]

θ[k+1] = argmaxθ

Q(θ|θ[k]).

2.2.1 Convergence Properties

The main property of MM-type algorithms, hence also of the EM algo-rithm, is that the sequential estimates force f(θ[k]) to increase monoton-ically, i.e.,

f(θ[k]) ≤ f(θ[`]), (2.8)

Page 26: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

2.2 Expectation-Maximization Method 11

for all k ≤ `. In other words, an MM step (EM step) is guaranteed to atleast achieve the same objective values (likelihood) as the current point.While this usually implies convergence to a stationary point, strongerglobal convergence properties such as convergence to a local maximum,can in general not be established [81]1.Local convergence of EM algorithms, on the other hand, is characterizedin [17], in the general case, and in [51,83] for specific problems. LetM(θ)represent the implicit mapping such that θ[k+1] = M(θ[k]) for any k ∈ Z.If the sequence of parameters θ[k] converges to2 θ then M(θ) = θ and bya Taylor expansion of M(θ) in the neighborhood of θ it follows

θ[k+1] − θ = ∇M(θ)(θ[k] − θ

). (2.9)

Thus, EM algorithms [17] and also MM algorithms [35] converge withlinear rate (as e.g., gradient methods). The rate is given by the spectralradius of ∇M(θ) (the largest eigenvalue of matrix ∇M(θ)). For the EM,the rate matrix ∇M(θ) can be expressed as [17]

∇M(θ) = I−(∂2Q(θ|θ)∂θ2

∣∣∣∣∣θ=θ

)−1d2`(θ)dθ2 , (2.10)

with `(θ) , log l(θ) and an analogous expression holds for MM algo-rithms [82]. When (2.7) is a ML estimation problem, the quotient onthe left of (2.10) can be given a statistical interpretation: ∂2Q(θ|θ)

∂θ2

∣∣∣θ=θ

corresponds to the Fisher information3 on θ given the missing variables x(complete Fisher information) and, similarly, d

2`(θ)dθ2 is seen as the Fisher

information of the ML problem (2.7) (incomplete Fisher information).Therefore, convergence rate of EM depends on the quotient of the com-plete Fisher information and the incomplete Fisher information.

1A typical counterexample to guaranteed convergence to local maxima are saddlepoints, where EM iterations can get trapped.

2Given that the mapping M(θ) is continuous.3 The Fisher information, given as

I(θ) = EX|θ

[(∂

∂θlog f(x|θ)

)2]

= EX|θ

[∂2

∂θ2 log f(x|θ)]

is a measure for the information that a random variable x conveys about a parame-ter θ [39]. The second equality holds under some mild regularity conditions on thelikelihood [39].

Page 27: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

12 Preliminaries

2.2.2 Gradient-Based Likelihood Optimization

A well-known problem with EM-based approximation of an ML estimateis that estimates can converge very slowly (e.g., [42, 61, 83]). To over-come this limitation various acceleration strategies have been proposed(see [51] for an overview). We point out two simple acceleration meth-ods, which are easily applicable in a wide variety of situations. Theseobservations and a lot of acceleration strategies carry over to generalMM-based methods [82].

2.2.3 Expectation-Maximization Acceleration Methods

The simplest acceleration method is step doubling [82]. This methodreplaces the next EM estimate θ[k+1] by the estimate

(1− η)θ[k] + ηθ[k+1]

with η > 1.Another strategy is to use the gradient of the surrogate function Q(θ|θ[k])and combine it with powerful first-order optimization techniques such asconjugate gradient or quasi-Newton methods [11]. When Q(θ|θ[k]) isdifferentiable in θ4, it turns out that the gradient corresponds to thegradient of the log-likelihood [17]:

∇θQ(θ|θ[k])∣∣∣θ=θ[k]

= ∇`(θ[k]). (2.11)

This implies that if a computationally feasible EM algorithm can bedevised in a specific application, gradient-based optimization of the like-lihood function itself, i.e., of the problem in (2.7), are readily available.Note that these schemes might exhibit faster convergence than EM, butgenerally are also first-order methods. More importantly, gradient-basedscheme do usually not provide monotonicity guarantees like (2.8) in caseof the EM algorithm.An additional idea is to devise methods that alternatively use EM stepsand gradient steps; Switching between the schemes according to a fixedschedule (e.g., as EM tends to have slow convergence in later iterations,switching to a gradient-based methods might yield faster convergence)

4This for instance always applies when the likelihood pertains to the exponentialfamily [81].

Page 28: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

2.3 Wiener filter 13

or an adaptive switching criterion based on the current estimates andsurrogate functions. In fact, good results have been obtained with sucha scheme in learning Gaussian mixtures [61].

2.2.4 Expectation-Maximization-based Estimation ofState Space Models

ML-estimation of linear SSM parameters is a well-known application ofthe EM algorithm: It was first presented in [63] in context of time-seriesmodels, corresponding to SSMs with known output matrix C, while acomprehensive treatment of the general discrete-time SSM (i.e., multi-dimensional inputs and multi-dimensional outputs) is presented in [24,25]. A message-passing view on EM, with applications in estimation ofSSMs, is given in [16]. In this thesis, we heavily rely on concepts andideas from the latter work.Given observations y = y1, y2, . . . and an order d, we are interested in it-eratively approximating the ML estimate of a single-input single-outputstochastic SSM parameters. To this end, we focus on the autoregressiveform parameterization of the SSM with 2d free parameters (see [16]):There are d parameters that define a, the first row of A, and d parame-ters for the vector c (the vector b is fixed). The ML parameter estimationproblem then translates into

a, c , argmaxa,c

p(y|a, c;σ2N , σ

2U ). (2.12)

We refer the reader to Algorithm A.3 in Appendix A.3 (cf. Step 3)and Step 4)) for a standard implementation of the EM method appliedto (2.12).We will typically take the noise variances as known or fixed a-priori(see, e.g., Section 6.3) and not consider additional estimation schemes.In many cases, ML estimation of the noise variances can be added toEM-based methods as in e.g., [16].

2.3 Wiener filter

A Wiener filter is the linear minimum mean-squared error (LMMSE)estimator of a wide-sense stationary random process, see e.g., [38]. In

Page 29: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

14 Preliminaries

the following, we focus on d-dimensional observations, for all k it holdsthat Yk ∈ Rd, and d-dimensional target signals and its LMMSE estimatesUk, Uk ∈ Rd present the more general multi-dimensional Wiener filter.The Wiener filter is chosen such that for all k the estimates Uk, givenby

uk =s∑l=k

Glyk−l,

where coefficient matrices Gk are the filter coefficient (matrices) of theWiener filter, minimize the conditional squared error

E[(Uk −Uk

)2|Yk+t, . . . , Yk+s

],

for −∞ < t < s <∞.One approach to obtain coefficients Gk is by means of the orthogonalityprinciple [38], which in the multi-dimensional case is expressed as

E[(Uk − Uk

)Y Tl

]= 0 (2.13)

and must hold for all l ∈ [k + t, k + s]. To proceed, we specialize onnon-causal IIR-type filters, thus (2.13) must be satisfied for all l ∈ Z.

Remark 2.1When the filter is of FIR-type (i.e., −∞ < t < s <∞), necessary compu-tations to obtain the (matrix) filter coefficients are slightly different thanfor the IIR case above: The orthogonality principle leads to a (block)Toeplitz linear system of equations, which has to be solved to obtain thefilter coefficients. An efficient linear-time algorithm to compute the filtercoefficients is the (block) Levinson-Durbin algorithm (see e.g., [34]).

Let S(UU)(ejθ) be the power spectral density of Uk given by the (matrix)discrete-time Fourier transform (DTFT) of the autocorrelation function

R(UU)k = E

[UkU

Tk−l].

We now assume that

Yk =∞∑l=0

HlUk−l +Nk,

Page 30: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

2.3 Wiener filter 15

with Nk ∈ Rd be (correlated) random vectors with mean 0 and finiteautocorrelation R(NN)

k i.e., observations yk are linearly distorted andnoisy versions of the original signal. Then let S(NN)(ejθ) be the powerspectral density of Nk. Basic manipulations yield the expressions

S(Y Y )(ejθ) = H(ejθ)S(UU)(ejθ)HH(ejθ) + S(NN)(ejθ)

for the power spectral density of Yk and likewise

S(UY )(ejθ) = S(UU)(ejθ)HH(ejθ)

representing the cross-spectral density of Yk and Uk. From (2.13) acomputational method to obtain the Wiener filter can now be found bysolving the linear equations

G(ejθ)S(Y U)(ejθ)− SY UT(ejθ) = 0,

which must hold at all frequencies θ ∈ [0, 2π]. Let G(ejθ) be the DTFTof the Wiener filter coefficients Gk. From (2.3) the DTFT of the non-causal Wiener filter follows as

G(ejθ) = S(UY )(ejθ)(S(Y Y )(ejθ)

)−1

= S(UU)(ejθ)H(ejθ)H

×(H(ejθ)S(UU)(ejθ)HH(ejθ) + S(NN)(ejθ)

)−1. (2.14)

If for all k the random vectors Uk and Nk are iid and spatially whitewith variance σ2

u and σ2n respectively, (2.14) simplifies to

G(ejθ) = σ2uH(ejθ)H (σ2

uH(ejθ)HH(ejθ) + σ2I)−1

.

Page 31: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 32: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 3

Numerically StableGaussianMessage-Passing

In practice, the application of Gaussian message-passing algorithms maybe limited by numerical stability problems. While theses issues are rel-atively certain to appear for large, complex problems, in some casesmoderately sized problems are affected.In this chapter, we consider methods applicable when standard Gaus-sian message passing implementations experience numerical instabilities.After outlining the principle to assess numerical stability (Section 3.1),we establish that fast sampled SSMs may cause poor numerical proper-ties of message-passing methods (Section 3.2). Next, we explore SSMreparametrizations and show how these idea can be exploited to im-prove robustness of model-based methods (Section 3.3). Finally, a novelmessage-passing scheme, called square-root message passing is intro-duced (Section 3.4). Subsequently, we focus on time-invariant single-input single-output models, but several underlying principles carry overto more general SSM classes.

3.1 Methodology

Assessing numerical properties of inference or model estimation in linearSSM (i.e., estimation of the state transition matrix A, input matrix Band output matrix C representing a fixed order SSM), is a very compli-

17

Page 33: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

18 Numerically Stable Gaussian Message-Passing

cated task due to recursive type of processing and non-linear operationsused. As a result, prior work on numerical stability of these algorithms islimited: there exists some analysis on numerical properties of (forward-only) Kalman filtering variants [53,69], but no relevant results seem to beknown for Kalman smoothing, let alone EM-based parameter estimation.In addition, known error bounds tend to be too loose.Our goal is not to provide a formal analysis of numerical stability prop-erties for message-passing algorithms relevant in this work, but to devisepractical solutions extending applicability of them.In Kalman filtering, and more generally in message-passing, inaccuraciescaused by finite-precision implementation of the computations come i)directly from an operation or ii) propagate and get amplified by therecursive algorithms. For Kalman filtering, it was shown in [69] that errorpropagation is mitigated as long that the covariances are kept symmetric.In contrast, we are interested in numerical errors caused by each messageupdates1 In Gaussian message passing algorithms, updates boil down tobasic matrix computations. While vector and matrix multiplications andadditions are numerically stable i.e., do not cause losses in significantdigits, matrix inversions can be problematic.Tight bounds on (relative) numerical errors for matrix inversions, aswell as other complex matrix operations, can be given in terms of thecondition number of the matrix, which is the ratio of the largest singularvalue and the smallest one, so

κ(A) , ‖ATA‖2‖A−TA−1‖2.

In the following, we concentrate on the condition numbers of the for-ward covariance matrix or the backward covariance matrix, as these arecentral in marginalization and estimation (see, e.g., EM update rules inAlgorithm A.3)

3.2 Decimation of Low-Pass-Type Systems

A common observation made with many sensors is that their frequencytransfer characteristic exhibits a low-pass behavior. This behavior can

1Another source of error encountered in message-passing computations on SSMsis in solutions to discrete algebraic Riccati equations (DAREs). We elaborate onnumerical issues with these methods in Appendix E.

Page 34: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.2 Decimation of Low-Pass-Type Systems 19

0 π4

π2

3π4

π

100103106109

101210151018

κ(−→ VX∞

)

(a) General random systems

0 π4

π2

3π4

π100

103

106

109

1012

1015

1018

(b) Resonant multi-mass systems (cf.Appendix F)

Figure 3.1: Condition numbers of −→VX∞ for random 8-th order systemsplotted against the mean value of their pole’s frequencies.

be very pronounced because of high sampling rates compared to thecharacteristic time constants of the physical sensor and as a result use-ful information is only present in a narrow low-frequency band. Anapplication where this issue occurs is discussed in Chapter 6.When considering SSM-based algorithms, slowly-varying band-limitedsystems may pose various challenges. In particular, for estimation andsystem identification, numerical issues may arise due to poorly condi-tioned covariances and precisions in Gaussian message passing. Notethat poorly conditioned covariances on one hand reduce numerical pre-cision of message updates that involve matrix inversions (e.g., computa-tion of marginals in message-passing for SSMs), and on the other handhave larger backward errors in the in the matrices themselves as men-tioned in [32]. We next illustrate with a numerical example how thisissue occurs for random SSMs and is particularly strong for multi-masssystems defined in Appendix F.

Example 3.1We generate 500 realization of two types of 8th-order discrete-time SISOSSMs and compute the condition number of the steady-state state-spacecovariance matrix−→VX∞ by solving the corresponding DARE [59]. For theclarity of the exposition, we measure the low-pass characteristic of SSMsby the mean value of the pole’s frequencies. Results are shown in Fig-

Page 35: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

20 Numerically Stable Gaussian Message-Passing

ure 3.1. In the left, condition numbers of random systems generatedby the Matlab command drss2 are shown. The right figure, presentsrandom resonant multi-mass systems (cf. Appendix F). Random realiza-tions are created by choosing random masses and randomly perturbingdamping coefficients of the multi-mass system models.In Figure 3.1, it can be seen that low-frequency poles lead to largerκ(−→VX∞). While for general systems the correlation is visible, for thespecial class of resonant multi-mass systems, which feature a highly at-tenuated spectrum for frequencies above the resonant frequency and alsoclose pole-zero pairs, this phenomenon is much more pronounced. In linewith observations made with actual multi-mass system measurements(see Chapter 6), it can be concluded from the example that model-basedmessage-passing algorithms for highly oversampled systems may are of-ten numerically unstable.

A simple but effective remedy to numerically unstable message-passingmethods is to reduce the sampling frequencies. In fact, for SSM realiza-tions in the previous example that exhibit a low-pass characteristics andthus, poorly conditioned covariances, the pole frequencies would increaseand condition numbers would consequently decrease. In practice, thisapproach can be implemented by decimation of signals before processingwith message-passing based methods.When system identification is performed, care must be taken to not affectthe model estimates through low-pass filtering prior to down sampling.However, since such signals have, by definition, a low-pass characteristic,this is usually not an issue. Model identification on decimated signalsfollowed by normal-rate filtering with a fixed model, was adopted in themethods described in Chapter 6 and showed very promising results.

3.3 State-Space Reparametrization

Let T ∈ Rn×n be an invertible matrix and assume an SSM of order n. Areparameterization with T of the state or equivalently of the SSM, mapsthe state vectors Xk onto new state vectors

Xk = TXk. (3.1)2The command drss is a standard routine that is widely accepted to evaluate

system identification methods. It generates random SISO systems of a given order.

Page 36: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.3 State-Space Reparametrization 21

· · · SXk

AXk

T +

T

B

N (0,Vu)

= · · ·Xk+1

S

C

N (y,Vn)

Figure 3.2: Factor graph of SSM, where states are transformedwith Xk = TXk. For the sake of graphical presentationwe define S , T−1.

The SSM parameters are then adapted such that the input-output be-haviour of the SSM is preserved:

A = TAT−1, B = TB, C = CT−1. (3.2)

Reparameterization can be shown graphically as well; They correspondto introducing the deterministic T and T−1 factors at each time stepand then moving these factors in the graph. The result is demonstratedin Figure 3.2.From the graphical representation Figure 3.2, it is evident how Gaussianmessages are transformed when reparametrizing an SSM. In particular,we see that the covariance matrices and precision matrices between dif-ferent parametrization (transformed by (3.1)) are related by

VXk= TVXk

TT, WXk= T−TWXk

T−1. (3.3)

Page 37: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

22 Numerically Stable Gaussian Message-Passing

3.3.1 Numerical Robustness

One cause of numerically poor estimation in SSMs are states that havewidely different dynamic ranges and amplitudes. This might then lead topoor condition numbers of the covariance matrix or the precision matrix.This effect is illustrated with a very simple example below.Example 3.2Consider a simple SSM of order 2 with parameters:

A =(ρ 00 −ρ

), b =

(1ε

),

c =(1 1

).

We are interested in the steady-state covariance matrix and its conditionnumber, given that observations of the output are corrupted by Gaussiannoise with variance σ2

N and the SSM is driven by Gaussian noise withvariance σ2

U . The steady-state covariance matrix for ρ = 0.99, ε = 0.01,σ2N = 0.01, and σ2

U = 1 is

−→VX =(

1.01 0.010.01 0.0002

),

and its condition number already 1.56∗104. At the root of this relativelyhigh condition number lie the different scales of the two states.To approach this issue, we re-parameterize the states by x′ = Tx with

T = diag(1, 100)

and the SSM according to (3.2), the condition number of −→VX is subse-quently reduced to 8.5.

Another issue that arises when choosing a specific parameterization, isthe sensitivity of the SSM’s properties to small changes in the coefficientvalues. The AR form, where AT is also known as companion matrix [28],is particularly susceptive to this issue (see [70, Example 7.4]). Neverthe-less, the AR is often the only possible parametrization choice3 for EM-based parameter identification. In Appendix E, we therefore provide a

3EM-based system identification for other parametrization (e.g., full system ma-trix [16] or block diagonal form [59]) requires state noise with full rank covariancematrices. For single-input systems of order larger than 1 that are only subject toinput noise, state noise covariance matrices are not full rank.

Page 38: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.3 State-Space Reparametrization 23

measure to assess the sensitivity in particular situations. Also, experi-mental observations regarding the AR form are given in Section 6.5.Sensitivity of the parameterization can be a relevant factor also for thecondition number of the DARE [32], with the result of, potentially, lessaccurate solutions.When SSM parametrizations can be chosen freely, a natural question iswhich representation yields the most stable numerical message-passingalgorithms. In the following, we focus on the condition number of co-variance matrices and precision matrices as a metric for the numericalrobustness of message passing computations. Inspired by Example 3.2and balanced reductions of high-order state-space models to low-ordermodels, see e.g., [60], we propose the following method to improve con-dition of SSM representations and hence also make message-passing im-plementations more robust.

Algorithm 3.1

1) Compute the steady-state covariance matrix −→VX∞ and precision←−WX∞ for the current state-space parameterization.

2) Obtain a transformation T that simultaneously diagonalizes −→VX∞and ←−WX∞ and also attempts to “balance” the eigenvalues of thematrices.

3) Transform the SSM using T and T−1 and perform computationsin the transformed SSM.

In the second step, a transformation matrix T is sought for that si-multaneously diagonalizes −→VX∞ and ←−WX∞ while also improving theirrespective condition numbers by making large diagonal values smallerand small ones larger. As shown in [28, P. 500], we can always find atransformation such that two symmetric positive semi-definite matricesare both diagonalized. Once both matrices are diagonalized, we are stillfree to choose d parameters to make the 2d diagonal values of both ma-trices more even. In fact, in our case we obtain one from the followingsteps.

Page 39: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

24 Numerically Stable Gaussian Message-Passing

Algorithm 3.2: Simultaneous Diagonalization of Steady-StateMatrices

1) Compute the eigenvalue decomposition of −→VX∞ , i.e.,

QΛQT ←−→VX∞ .

2) Perform the eigenvalue decomposition of

DΞDT ← Λ1/2QT←−WX∞QΛ1/2,

where Λ = Λ1/2Λ1/2.

3) Obtain T fromT← Ξ1/4DTΛ−1/2QT.

and T−1 fromT−1 ← QΛ1/2DΞ−1/4.

By inspection it is seen that the computed T transforms the SSM suchthat the steady-state matrices (cf. (3.3)) are diagonal matrices as

T−→VX∞TT = Ξ1/2

T−T←−WX∞T−1 = Ξ1/2.

Now, observe that

T−T←−WX∞T−1T−→VX∞TT = T−T←−WX∞

−→VX∞TT = Ξ. (3.4)

Typically, the eigenvalues of ←−WX∞

−→VX∞ are much closer to 1 than theones of the steady-state matrices −→VX∞ and ←−WX∞ . The similarity trans-formation in (3.4) then implies the diagonal entries of Ξ1/2 are moreeven and hence, the transformed steady-state matrices will have a bet-ter condition number4.

4 The product in (3.4) also shows another point: No matter what T we select,the eigenvalues of the product are always the same and, in some sense, distributingthe product evenly among the two matrices is a reasonable choice.

Page 40: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.4 Square-Root Message Passing 25

3.3.2 State-Space-Model Identification

When considering estimation of SSM parameters under reparametriza-tions, a natural question that arises is if convergence properties of theEM algorithm can be improved by a suitable state transformation. Thishypothesis can be answered negatively;

Proposition 3.1Let `(θ) = log p(y|θ) be the likelihood of a linear time-invariant SSM,where θ is defined as

θ =

vec (A)vec (B)vec (C)

.

Let θ be a stationary point of the likelihood, i.e., ∇`(θ) = 0.

Assume that an EM-estimate θ converges to θ. Its local convergencerate ρθ (cf. Section 2.2) is invariant to parametrization of the SSM.

The proof is provided in Appendix B on page 135.

Remark 3.1From the proof of Proposition 3.1, it is easily seen that with non-linearlyparametrized SSM5 the local convergence rate is also invariant to changesof the SSM form. In the proof, the transformation T is replaced by therespective Jacobian matrix.

3.4 Square-Root Message Passing

Computation of numerical values for the mean, covariance, and pre-cision matrices are key in Gaussian message-passing implementations.Numerical stability of these computations are determined by the largest

5 A simple, yet relevant example is a second-order SSM with transition matrixparametrized by ρ and ω

A = ρ

(cos (ω) sin (ω)sin (ω) cos (ω)

).

Obviously, ω is non-linearly related to entries of A. For more examples of non-linearparametrizations refer to e.g., [59, Section 3.2.2].

Page 41: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

26 Numerically Stable Gaussian Message-Passing

condition numbers of matrices involved in the message-passing calcu-lations and asymmetric errors6 in numerical values of symmetric matri-ces [69,74]. It is possible to ensure that second-order variables are strictlysymmetric by utilizing “square-roots” instead of full matrices. Assum-ing S is the square root of a symmetric positive definite matrix M. Apartfrom symmetry that is preserved, the reduced matrix condition numbersatisfies

κ(S) = ‖STS‖2‖S−TS−1‖2 = ‖M‖2‖M−1‖2 =√κ(M)

and usage of numerically stable orthogonal transformations for each re-cursion step improve reliability when for poorly conditioned systems.We present here a square-root technique for Gaussian message pass-ing based on square-root Kalman filtering techniques from [38]. Theworkhorse will be the numerically robust (Householder) QR decomposi-tion7 [28].Eventually, we extend the square-root approach to computation of EMmessage for learning linear SSMs.

3.4.1 Gaussian Square-Root Message Passing

Let the factors of covariance matrices V ∈ Rd×d and precision matricesW ∈ Rd×d be defined as

V , CTC, W , STS. (3.5)

A special case is where C and S are upper triangular matrices and cor-respond to Cholesky factors. Since covariance matrices are by definitionsymmetric and positive definite, there always exists a unique decom-position into Cholesky factors. When the matrices are merely positivesemi-definite, a Cholesky decomposition still exists, but it is not unique.Note that in general the factor messages need not be upper triangu-lar but will always be similar to the Cholesky factors by an orthogonaltransformation.

6The numerical value of a symmetric matrix M is affected by asymmetric errorsif M−MT 6= 0.

7 The QR decomposition decomposes any matrix M ∈ Rn×m into

M = QR,

that is the product of an orthogonal matrix Q ∈ Rn×n and an upper triangularmatrix R ∈ Rn×m.

Page 42: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.4 Square-Root Message Passing 27

Akin to standard Gaussian message passing, scaled normal densities willbe either parametrized by the matrix-vector pair C,Cm or, equivalently,by the pair S,Sm. If the factors are full rank recovering the mean vectorm from either Cm or Sm can be done in a computationally efficientand numerically-stable fashion (e.g., back substitution) [28].To take full advantage of the superior numerical properties of square-rootmatrices, it is necessary to avoid squaring of factors and use rotation-based algorithms such as QR decomposition [28]. Let us show this bymeans of the following example.

Example 3.3Consider the update of the precision factor −→SZ and corresponding vector−→SZ−→mZ for the case of an “=”-factor and look at the QR decompositionof the matrix (−→SX −→SX−→mX

−→SY−→SY−→mY

)= Q

(−→SZ −→SZ−→mZ

0 ×

), (3.6)

where × stands for an arbitrary vector. Multiplying both sides of (3.6)with its respective transpose provides(−→ST

X

−→SX +−→STY

−→SY−→STX

−→SX−→mX +−→STY

−→SY−→mY

× ×

)=(−→ST

Z

−→SZ−→STZ

−→SZ−→mZ

× ×

).

(3.7)

We recognize with (3.5) that the first entry on the lefthand side corre-sponds to the well known precision matrix update. This implies that−→STZ

−→SZ = −→WZ and −→SZ is the Cholesky factor of the updated covariancematrix. An analogous conclusion can be drawn for −→SZ−→mZ .In the following Tables 3.1 to 3.3, the orthogonal matrix Q in the updateequations is decomposed into two 2d× d blocks Q1 and Q2 and appliedfrom the left to the equations. Two obvious conclusions can be drawn.First, the updates for the factor −→S and the vector −→S−→m may obviouslybe split into two distinct computations. The update rule correspondingto (3.6) is then (−→SZ

0

)=

(Q1 Q2

)T︸ ︷︷ ︸orthonormal matrix

(−→SX−→SY

),

Page 43: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

28 Numerically Stable Gaussian Message-Passing

−→SZ−→mZ = QT1

(−→SX−→mX

−→SY−→mY

).

Second, computational overhead can be reduced, as the orthogonal ma-trix Q1 remains constant and thus does not have to be recalculated forevery input vector in steady-state scenarios. Given that QR decomposi-tion are complex matrix operations, savings can be substantial.

The mean can be propagated in standard form or in a novel adaptedform. By coincidence, all updates performed on the standard mean vec-tors in square-root updates below use “square-root” quantities and hence,numerical accuracy is not an issue when using the standard mean vectoralongside factor messages for the covariance and precision matrices.In our proposed method, we use the mean m together with the covari-ance factor C and the vector Sm in conjunction with propagation of theprecision factors S, which has the appealing property that it does notchange for factor updates.In Tables 3.1 to 3.3 we provide the square-root message updates com-monly used in Gaussian message passing for SSM. The updates are ex-pressed using block matrices. Matrices used to transform the messagesand obtain the output message are denoted by Q. As noted above, thesematrices and/or the respective output message shall be computed withnumerically stable algorithms (e.g., QR decomposition or Householdertransformation) to benefit from the improved numerical properties ofsquare-root algorithms. If not stated otherwise, the factor messages ared × d matrices and vectors have dimensions d. All expressions can beproven analogously to (3.6) and (3.7) i.e., by “squaring” both sides ofthe relations and recalling the corresponding Gaussian message passingrule.Note that in (I.3)–(I.5) the matrix A does not need to be triangular.Consequently, the resulting factor −→CY will in general neither be uppertriangular.A practical smoothing algorithm with square-root message passing thatcombines updates from Table 3.1 to Table 3.3 is shown in Appendix A.1on page 127.

Page 44: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.4 Square-Root Message Passing 29

Node Update ruleX

N (m,V)

Assume proper (i.e., invertible V) density:

−→CTX

−→CX = V, −→mX = m, (I.1)−→STX

−→SX = V−1,−→SX−→mX = V−1/2m. (I.2)

AYX For arbitrary A ∈ Rn×m.

Forward:−→CY = −→CXAT, (I.3)−→mY = A−→mX . (I.4)

Backward:←−SX = ←−SY A, (I.5)

←−SX←−mX = ←−SY←−mY . (I.6)

= Z

Y

X (−→SZ0

)=(Q1 Q2

)T(−→SX−→SY

), (I.7)

−→SZ−→mZ = QT1

(−→SX−→mX

−→SY−→mY

), (I.8)

with Q1 ∈ R2d×d.

+Z

Y

X (−→CZ0

)=(Q1 Q2

)T(−→CX−→CY

), (I.9)

−→mZ = −→mX +−→mY . (I.10)

For ←−CX , replace −→CX by ←−CZ in (I.9) and reversethe sign and replace −→mX by ←−mZ in (I.10).

Table 3.1: Standard message updates in square-root form.

Page 45: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

30 Numerically Stable Gaussian Message-Passing

Node Update rule

= Y

Z

X

If I +−→CX←−STY is nonsingular:(

R G

0 −→CZ

)=(Q1 Q2

)T( I 0−→CX←−STY

−→CX

),

(II.1)−→mZ = −→mX + GTR−1(←−SY←−mY −

←−SY−→mX) ,(II.2)

where R is upper triangular.

+Z

Y

X (× ×

0 −→SZ

)=(Q1 Q2

)T(−→SX 0−→SY

−→SY

), (II.3)

−→SZ−→mZ = QT1

(−→SX−→mX

−→SY−→mY

), (II.4)

with Q1 ∈ R2d×d. For ←−SX , replace −→SX by ←−SZ in(II.3) and add a minus sign in front of −→SY−→mY andreplace −→SX−→mX by ←−SZ←−mZ in (II.4).

Table 3.2: Additional square-root message passing updates.

Page 46: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.4 Square-Root Message Passing 31

Node Update rule

= ZX

A

Y

(R G

0 −→CZ

)= QT

( −→CY 0−→CXAT −→CX

), (III.1)

If −→CY +−→CXAT is nonsingular:−→mZ = −→mX + GTR−T(−→mY −AT−→mX) .

(III.2)

For ←−CX , replace −→CX by ←−CZ in (III.1) and −→mX by←−mZ in (III.2).

+ZX

A

Y

Suppose A ∈ Rn×m(R G

0 −→SZ

)=(Q1 Q2

)T( −→SY 0−→SXA −→SX

),

(III.3)

with Q2 ∈ Rn+m×n then

−→SZ−→mZ = QT2

(−→SX−→mX

−→SY−→mY

), (III.4)

or if −→SY +−→SXA is nonsingular:−→SZ−→mZ = −→SX−→mX + GTR−T(−→mY −AT−→mX) .

(III.5)

For ←−SX , replace −→SX by ←−SZ in (III.3) and, in ad-dition to reversing the sign of −→mY ,

−→SX−→mX by←−SZ←−mZ in (III.4) or −→mX by ←−mZ in (III.5).

Table 3.3: Square-root message passing updates for composite blocks.

Page 47: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

32 Numerically Stable Gaussian Message-Passing

3.4.2 Computational Optimization

In the next part we present ways to lower the computational complex-ity of square-root message passing. In all updates with d-dimensionalmessages, the matrix Q can be computed in 4d3 +O(d2) flops8 from theQR-decomposition of the block matrices by employing a Householder QRdecomposition9. When using the alternative update (3.6) (e.g., whenthe update with the respective factors is just performed once), the or-thonormal matrix Q is not necessary and computational complexity maybe saved by employing a Householder triangularization with complexity43d

3 +O(d2) flops [28].Another improvement over straightforward application of the square-root message update rules is to lower the number of required QR de-compositions by propagating non-upper triangular matrices. It is easilyseen that the ideas behind the square-root technique also apply whenthe factors are not necessarily upper triangular (see (3.1) to (3.3)).

3.4.3 Expectation-Maximization Updates

We first note that many EM-algorithms to estimate e.g., SSM parame-ters A, b, and c, may be expressed as Gaussian messages [16]. Hence,the idea of square-root factors can be extended to EM-algorithms and ispursued further in the following section.Two steps, viz. the E-step and M-step, are performed in EM-algorithms.In the M-step, factors of the EM messages are propagated through “=”nodes and (I.7) and (I.8). In the E-step, marginal Gaussian probabilitydensities in an SSM must be computed first. To this end, a square-rootmessage-passing based smoothing algorithm as e.g., Algorithm A.1, canbe employed to yield the marginals. Secondly, squaring of the marginalfactors must be avoided also in the calculation of (2.2) to carry thefavorable numerical properties over to the EM algorithm. We illustratethis concept with the computation of the square-root EM message for Ain autoregressive form.

8One flop corresponds to either one of the following floationg-point operations:multiplication, division, addition or subtraction.

9The modified Gram-Schmidt-based QR decomposition can also be used to obtainthe upper-triangular factor in the QR decomposition. Even though it is more efficientin terms of flops, its numerical stability guarantees are weaker than the ones of theHouseholder QR [28]. It is therefore not well suited for the problem at hand.

Page 48: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.4 Square-Root Message Passing 33

×X + Y

e1

N (0, σ2)

A

a←−Sa,←−Wa←−ma ↑

Figure 3.3: Factor graph illustration of square-root message passingand EM algorithms in Example 3.4. The matrix A is inAR form and the input noise is one-dimensional with e1 =[1, 0, . . . , 0]T.

Example 3.4Consider computation of the EM messages for a in Figure 3.3, whichis essential for EM-based estimation of the AR-form matrix A, withsquare-root messages. We assume that −→CX and ←−SX are given. The goalis then to compute Sa and ←−Wa

←−ma: the factor and mean of the EMmessage for a.Given the square-root factor of the marginal probability over X and Y ,observe that the precision of the EM message with standard Gaussianmarginals [16, Equations (III.7)]) can be expressed as

←−Wa = σ−2 (VX +mXmTX

)= σ−2

(CXmTX

)T(CXmTX

),

where VX and mX are the covariance matrix and mean of variable Xin Figure 3.3. The square-root EM factor ←−Sak

←−mak is thus in upper-triangular form: (←−Sak←−mak

0

)=(Q1 Q2

)T(CXk

mTXk

).

Computations can be saved by propagating non-upper-triangular fac-

Page 49: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

34 Numerically Stable Gaussian Message-Passing

tors ←−Sak←−mak and hence

←−Sak←−mak =

(CXk

mTXk

).

The square-root factor of the marginal probability over

Z ,

(X

Y

)can be computed by (II.1) using the forward message(−→CZ

0

)=(Q1 Q2

)T(−→CX (I AT)(0 σ2eT

1) )

with e1 = [1, 0, . . . , 0]T and the backward message←−SZ =

(0 ←−SY

).

The resulting square-root factor has form

CZ =(

CX ΓXY0 R

), (3.8)

with R upper-triangular. By squaring CZ it becomes evident that thetop right corner of the matrix in (3.8) is the desired factor of the marginalprobability over X and that

VXY = XTXΓXY . (3.9)

The marginal mean follows from (II.2) analogously to the computationof factor CZ , so(

mX

mY

)=(

IA

)−→mX + GTR−1

(←−SY←−mY −←−SY A−→mX

)(3.10)

with the matrices G and R given as byproduct of the computationin (II.1) of CZ .

From [16, Equations (III.8)]) the EM-message vector ←−Wak←−mak is

←−Wak←−mak = σ−2 (CT

XΓXY +mXmTY

)e1,

Page 50: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.4 Square-Root Message Passing 35

N (m,V)

· · ·X = =X ′ =X ′′ · · ·

ck−1

N (yk−1, σ2)

Yk−1

ck

N (yk, σ2)

Yk

ck+1

N (yk+1, σ2)

Yk+1

Figure 3.4: Prior and two steps of an RLS problem.

where mX is given in (3.9) and mY in (3.10).In practical implementations, computations are performed from right toleft in the and −→V −1

X′k−1

= −→C−1X′k−1

−→C−TX′k−1

is solved by two applications ofback substitution [28].

3.4.4 Experimental Results

We apply the proposed square-root message passing rules to a poorlyconditioned recursive least-squares (RLS) problem and show the bene-fits in terms of numerical accuracy by comparison to the standard RLS(message passing) and the two versions of the full least-squares problems.Consider an unknown random variable X ∈ Rd with prior probabilityN (m,V) and with noisy projections yk = ckX for k = 1, . . . , N asdepicted in the factor graph in Figure 3.4.Using standard Gaussian message passing the least-squares estimate ofX can be computed for all k = 1, . . . , N by means of the forward recur-sion

−→WX ←−→WX + 1

σ2 cTkck, (3.11)

−→WX−→mX ←

−→WX−→mX + yk

σ2 cTk , (3.12)

where −→WX and −→WX−→mX are initialized to V−1 and V−1m respectively.

Page 51: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

36 Numerically Stable Gaussian Message-Passing

200 400 600 800 1,000

10−7

10−6

N

NMSE

−→S and −→S−→m−→W and −→W−→mdirect QR solverdirect matrix inversion

Figure 3.5: Relative estimation error for various RLS implementationsdepending on signal length N . Results were generated byMonte-Carlo simulation with RLS of order n = 20 andσ2 = 10−6. The observation vectors c were randomly gen-erated and in a second step, the resulting random RLSproblem’s condition number was set to 105.

The final Gaussian estimate is computed as the solution to:

−→WX xGMP = −→WX−→mX . (3.13)

Square-root Gaussian message passing can be implemented with an anal-ogous forward recursion: According to (I.2) and (I.7) the message up-dates at the kth equality factor can be expressed as(−→SX −→SX−→mX

0 ×

)← QT

(−→SX −→SX−→mX

1σck

ykσ

)︸ ︷︷ ︸

S

, (3.14)

Page 52: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

3.4 Square-Root Message Passing 37

with the initial messages−→SX ← chol(V−1) (3.15)

−→SX−→mX ←−→SXm. (3.16)

The final estimate xSQ is given analogously to (3.13):−→SX xSQ = −→SX−→mX . (3.17)

To demonstrate the effectiveness of square-root message passing to ad-dress numerical issues, we implemented computation of xGMP and xSQwith double-precision floating-point arithmetic. Both recursions exhibitan O(d2) computational complexity per iteration. In (3.14) this com-plexity is achieved by utilizing Householder triangularizations and takingadvantage of the block structure of S. To compute the solution of thelinear system of equations in (3.13) a Cholesky-based solver is employed,while (3.17) is solved by back substitution.In addition we compare both methods with the numerical values obtainedby solution of the full least-squares system with

C , [cT1 , . . . , c

TN ]T (3.18)

and all y1, . . . , yN stacked in a vector y. Two version of the least-squaresmethod were used to match the respective message-passing algorithm:A squared approach with(

CTC + σ2I)xINV = CTy

involving the solution of a d× d linear system of equations and with theQR decomposition of C directly applied to C.We perform a Monte-Carlo simulation with 500 simulation runs, randomX ∈ R20, and a poorly conditioned matrix C. To obtain a randommatrix C, first iid random Gaussian vectors are drawn and then stackedas in (3.18). In a second step, all singular values of this matrix are scaledsuch that the ratio between the maximum value and the minimum valuei.e., the condition number, is 105. The NMSE of estimates with differentalgorithms is compared to the true value and plotted for various RLSlengths N in Figure 3.5.In Figure 3.5, note that both approaches that are based on the “squared”values of the matrices, i.e., estimates xGMP and xINV floor and are af-fected by a numerical error on the order of 10−1. On the other hand, the

Page 53: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

38 Numerically Stable Gaussian Message-Passing

proposed square-root message passing algorithm and its counterpart theQR-based least-squares solver have virtually the same numerical accu-racy (the two curves lie on top of each other in Figure 3.5) and no errorfloor in the evaluated range.

3.5 Conclusion

We have proposed methods that improve numerical stability of Gaussianmessage passing in model-based signal processing. The methods differprimarily in their scope of application, with the most specific one be-ing decimation of SSMs, applicable to strongly bandlimited SSMs whenresults at lower time resolutions are acceptable (cf. model-based filterin Chapter 6). When the parametrization of SSMs is unrestricted e.g.,in input estimation and smoothing, balanced state-space realization maybe applied. However, for many system-identification tasks, numericalstability is an issue but reparametrization is not a viable option, sincestate-space realization fixed (cf. results in Section 6.5). One approachto improve numerical stability in these cases and in a wide range ofapplications, is to tackle the problem at its origin: the representationof Gaussian pdfs themselves. In square-root message passing, numericalaccuracy is instantly improved by propagating the square-root decompo-sition of the matrices. We have tabulated important computation rulesfor the square-root message representation and have outlined novel al-gorithms for EM methods that are representable as Gaussian messagepassing algorithms. Another pertinent Gaussian message representationthat improves numerical robustness of inference methods for a large classof SSMs is shown in Chapter 4.

Page 54: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 4

Gaussian MessagePassing withDual-Precisions

Matrix inversions, especially for large matrices, are undesirable for rea-sons of computational complexity and their potentially poor numericalproperties (cf. Section 3.1). There exist Kalman filtering techniques i.e.,the Modified Bryson–Frazier (MBF) smoother [7] and Gaussian messagepassing schemes [47, Algorithm E] that avoid matrix inversion to com-pute marginal probabilities of the SSM’s state variables. In this chapter,we complement and formalize the latter approaches and present a perti-nent Gaussian message passing variant that in many cases yields matrixinversion free smoothing-type algorithms. Eventually, we derive a novelefficient and matrix inversion free algorithm to compute marginal prob-abilities for inputs in SSMs with uncorrelated observation noise.In the beginning, we derive message passing rules for commonly used inlinear Gaussian factor graphs (Section 4.1 and Section 4.2). Using theseexpressions we derive efficient versions of relevant smoothing algorithmsthat utilize Gaussian message passing (Section 4.3).

4.1 Dual-Precision

In [47] the authors present a forward-backward message passing schemefor SSMs that does not require matrix inversions (Algorithm E). It per-forms a forward (in time) recursion using Gaussian message represented

39

Page 55: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

40 Gaussian Message Passing with Dual-Precisions

as −→V and −→m, followed by a backward recursion based on ←−W and ←−W←−m. Ina third step, marginal posterior probability densities are computed withvariables V, W, and m by alternating the message updates:

WXk = ←−WXk −←−WXkVXk

←−WXk (4.1)WX′

k−1= ATWXkA

VXk = −→VX′k−−→VX′

kWX′

k

−→VX′k

(4.2)

mXk = −→mX′k−−→VX′

kWX′

k

−→mX′k

+ VX′k

←−WX′k

←−mX′k, (4.3)

where Xk and X ′k are the kth state before an “=”-factor and after it.Additionally to the auxiliary variable

WX ,(−→VX +←−VX

)−1,

called dual precision in [47], we also define the new auxiliary variable

WXµX , WX(−→mX −←−mX). (4.4)

As is evident from Table 4.1, Wµ mirrors distinctive properties of W(e.g., invariant at “+”-factors).We can now express (4.3) as follows

mX = VX(−→WX−→mX +←−WX

←−mX)

= (−→VX −−→VXWX

−→VX)−→WX−→mX +−→VX

−→WXVX←−WX←−mX

= −→mX −−→VX WX(−→mX −←−mX)︸ ︷︷ ︸

WXµX

, (4.5)

where WX = −→WXVX←−WX was used in the last step.

The variables W and Wµ exhibit a few remarkable properties. Themarginal covariance and the marginal mean can be retrieved by meansof (4.2) and (4.5) without matrix inversions or having to solve a linearsystem of equations. Furthermore, both quantities can be propagatedbackwards through factor nodes in the same way as ←−W and ←−W←−m, andthey are invariant at “+”-factors [47].

Page 56: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

4.2 Message Passing Tables 41

4.2 Message Passing Tables

We recognize that in Algorithm E [47] the backwards recursion is onlyused in (4.1) and (4.3). All other computations depend on quantitiesfrom the forward recursion and ←−W and ←−W←−m. Hence, an alternativerelation to (4.1) and (4.3) is sought for, making the backward recursionwith←−Wand←−W←−mobsolete and yielding significant computational savings.This relation can be found, by observing that marginals are invariant atequality factors (cf. Eq. (II.5) and (II.6) in [47]). Combining (4.2) and(4.1) then results in the relation (IV.7) for W. For Wµ an analogouscomputation results in the expression (IV.8). Note that rule (IV.7) re-quires no matrix inversions when the variable Y has dimension one suchas e.g., one-dimensional observations in SSMs and only depends on themessages −→V and the related variable G computed during the forwardrecursion. Additional relations are listed in Table 4.1.In Table 4.1, we synthesize relevant and useful expressions for Wand Wµfrom [47] and provide the new expression for “=”-factors.

4.3 Algorithms Based on W and Wµ

We will apply the message passing rules in Table 4.1 to two commontasks in SSM-based processing: smoothing in SSM and computation ofthe expectation step in EM-based estimation of the SSM parameters. Inparallel, benefits of these algorithms will be highlighted.

4.3.1 Smoothing in SSMs

We are now ready to devise a smoothing scheme based on the auxiliaryquantities Wand Wµ. Consider a length K vector of scalar observationsy1, . . . , yK and a time-invariant1 nth order SSM. A factor graph repre-senting time step k is shown in Figure 4.1. First, a forward pass with −→Vand −→m is performed, i.e., the covariance Kalman filter. For each messageupdate at a composite equality blocks the auxiliary matrices F and Gdefined in (IV.9) and (IV.10) are utilized to compute the next message

1All algorithms extend to time-variant scenarios. This property is only used tokeep notation simple.

Page 57: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

42 Gaussian Message Passing with Dual-Precisions

Node Update rule

AYX

A ∈ Rn×m

Backward:WX = ATWY A (IV.1)

WXµX = ATWY µY (IV.2)Forward:

WY = (AT)†WXA† (IV.3)WY µY = (AT)†WXµX (IV.4)

where A† denotes the pseudo-inverse of A as Aneed not be full rank for the relations to hold.

+Z

Y

X

WX = WY = WZ (IV.5)WXµX = WY µY = WZµZ (IV.6)

= Z

A

Y

X

WX = FWZFT + ATGA (IV.7)WXµX = FWZµZ −ATG(A−→mX −←−mY ) (IV.8)

F , I−ATGA−→VX (IV.9)

G , (A−→VXAT +←−VY )−1 (IV.10)

Analogous relations to compute WZ and WZµZ

are obtained by replacing −→VX by ←−VZ and −→mX by←−mZ .

Table 4.1: Message passing rules for Gaussian message representationwith W and Wµ extending results in [47].

Page 58: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

4.3 Algorithms Based on W and Wµ 43

· · · AX ′k−1 +

B

=Xk · · ·

X ′k

c

+N (0, σ2N )

yk

N (0,VU )

Uk−1

Figure 4.1: One time step in a time-invariant SSM with, possibly,multi-dimensional input vectors and one-dimensional ob-servations. Parameters of the SSM are the state-transitionmatrix A, input matrix B, and output vector c and im-portant variables are also denoted.

and stored as intermediate values along with the incoming message (i.e.,−→V and −→m) . Next a backward recursion using the relations from Ta-ble 4.1 and (auxiliary) variables stored in the forward pass is performed.Assuming an uninformative final state, W and Wµ are initialized atobservation yN as:

WXN = 1σ2 c

Tc

WXNµXN = 1σ2 c

T.

Marginal probability densities follow directly from (4.2) and (4.5). If Xk

denotes the state variable left of the kth observation, then

VXk = −→VXk −−→VXkWXk

−→VXkmXk = −→mXk −

−→VXkWXkµXk .

Page 59: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

44 Gaussian Message Passing with Dual-Precisions

Regarding the marginal of the input Uk, by the invariance of W and Wµand using (IV.1) and (IV.2) we readily obtain

VUk = VU −VUBTWXkBVUmUk = mU −VUBTWXkµXk .

Marginals for Yk can be computed from VXk andmXk by multiplicationwith c.

Algorithm 4.1: Dual-Precision SmoothingAssume an SSM with inputs u1, . . . , uK and observations y1, . . . , yK . Inaddition, denote by −→µinit and ←−µinit the initial forward and backward mes-sages.

1) Initialize the forward message −→VX1 ← Vinit,−→mX1 ←

−→mXinit .

2) Pass recursively passing through k ∈ [1, . . . ,K−1] and compute thefollowing message updates (intermediate results are only requiredfor smoothing):

i) Perform update through A factor and “+”-factor with:−→VX′

k+1← A−→VXkAT + σ2

N ,BBT

−→mX′k+1← A−→mXk .

ii) Update messages after observation yk+1:

Gk+1 ←1

c−→VX′

k+1cT + σ2

N

,

Fk+1 ←(I−−→VX′

k+1cTGk+1c

),

−→VXk+1 ← Fk+1−→VX′

k+1,

−→mXk+1 ← Fk+1−→mX′

k+1+Gk+1

−→VX′k+1cTyk+1.

3) If the SSM ends with an open edge at k = K, initialize the backwardmessages

WXK ←1σ2 c

Tc,

WXKµXK ←1σ2 c

T,

Page 60: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

4.3 Algorithms Based on W and Wµ 45

with neutral values. Otherwise compute WXK and WXKµXK us-ing (4.1) and (4.4).

4) Perform a backward (in time) sweep through k ∈ [K−1, . . . , 1] andcompute the following message updates:

i) Pass message through the A factor:

WX′k← ATWXk+1A,

WX′kµX′

k← ATWXk+1µXk+1 .

i) Use (IV.7) and (IV.9) for update at “=”-factor:

WXk← Fk

TWX′kFk + cTGkc, (4.6)

WXkµXk

← FkTWX′

kµX′

k− cTGk

(c−→mX′

k− yk

)5) Computation of posterior probability forXk for any k ∈ [1,K] with

VXk= −→VXk

−−→VXk

WXk

−→VXk

mXk= −→mXk

−−→VXk

WXkµXk

.

4.3.2 Steady-State Smoothing

Let the SSM be time-invariant and observable. Consider the process-ing of long signals or SSMs. Then, since the SSM is observable, thecovariance matrix −→VXk converges to the solution of the DARE

−→VX∞ = A−→VX∞AT −A−→VX∞cT(cT−→VX∞c+ σ2

)−1c−→VX∞AT + BVUBT

(4.7)for k large [38]. This is stated in the following proposition, which isshown in Appendix B.2 on page 136.

Proposition 4.1: Steady-State Solution for WX∞

Let the SSM be time-invariant and observable. If the forward message-passing is initialized with the steady-state matrix −→VX∞ , then the auxiliarymatrix WX converges to WX∞ and can be obtained from the solution ofa discrete Lyapunov equation

AlpWX∞ATlp − WX∞ + Qlp = 0, (4.8)

Page 61: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

46 Gaussian Message Passing with Dual-Precisions

Per sample for smoothing

Matrix mult. Matrix inv. Storage

RTS [38] 4 1 −→VXk ,−→VX′

k, −→mX′

k

Two-filter 6 1 −→VXk ,−→mXk

smoothing [38]Algorithm E [47] 10 0 −→VXk ,

−→mXk ,←−WX′

k,←−WXk

←−mXkW and Wµ 4 0 Fk,Gk, −→mX′

k

Table 4.2: Comparison of computational and memory requirements ofvarious Kalman smoothing algorithms with our proposedmethod. All figures are stated per time step. Operationcounts are restricted to the smoothing step, as all algorithmsuse the same filtering recursions. Storage refers to all vari-ables that need to be saved between forward and backwardsweeps.

with variables

Alp , AT(I− cTG∞c−→VX∞),Qlp , ATcTG∞cA,

G∞ , (A−→VX∞AT +←−VY )−1

Proposition 4.1 provides us with a linear equation2 to compute thesteady-state WX∞ , and hence all matrix quantities in our smoothingscheme (Algorithm 4.1) offline and substantially reduce computationalcomplexity during online processing. In addition to reduced compu-tational overhead for SSM smoothing, utilizing steady-state quantitiesalso possesses appealing numerical properties; computation of a solutionfor (4.7) and (4.8) can be performed with numerically stable solvers,yielding more accurate results than message passing iterations.Departing from our initial assumptions and considering general multi-input multi-output SSMs, we see that the steady-state specialization of

2Since the Lyapunov equation is linear, solution methods have better numericalproperties than the ones for the DARE.

Page 62: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

4.3 Algorithms Based on W and Wµ 47

our smoothing algorithm also delivers marginal probabilities without anymatrix inversions needed. This follows as the matrix G can be obtainedas a byproduct when solving the DARE (4.7).

4.3.3 E-Step in SSMs

Consider estimation of the SSM (i.e., A, b, c) and the noise varianceswith the EM algorithm and the variables given in the SSM shown in Fig-ure 4.1. From inspection of the E-steps (cf. [16]) we recognize that themarginal quantities VXk andmXk for all k are sufficient to compute theupdate of c and σ2

n and similarly, given all VUk and mUk the E step forb and σ2

u is readily computed. All these quantities can be obtained in acomputationally efficient manner from the algorithm in Section 4.3.1.Estimation of A also relies on VXk and mXk , but additionally requirescomputation of the cross-covariance of X ′k−1 and Xk, i.e.,

VX′k−1X

Tk, E

[(X ′k−1 −mX′

k−1)(Xk −mXk)T

],

while, of course, we have

mX′k−1

= mXk−1 .

Again the desired relation should only depend on variables that are com-puted during a −→V and −→m sweep through an SSM and the dual-precisionand avoid matrix inversions. It turns out that VX′

k−1XTkmay be com-

puted:

VX′k−1X

Tk

= −→VX′k−1

AT(I− WXk

−→VXk

). (4.9)

The proof of (4.9) is given in Appendix B.2 on page 136.

Remark 4.1Observe that no assumptions on any matrix were done in the proofof (4.9). We conclude that (4.9) also holds when e.g., ←−WXk or A donot have full rank. Derivation of (4.9) directly from the relation givenin [16, Equation (IV.6)] is cumbersome, without assumptions on the rankof involved matrices.

Page 63: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

48 Gaussian Message Passing with Dual-Precisions

4.3.4 Continuous-Time Smoothing

For continuous-time SSM

dX(t) = AX(t) dt+ bU(t) dtY (t) = cX(t) +N(t)

with observations yk = Y (tk) at discrete moments tk for k ∈ [1,K].Computation of posterior probabilities for X(t) at any tk ≤ t ≤ tk+1

may be computed by relying just on −→V, −→m, W, and Wµ. Apart from ad-vantages put forward for smoothing in discrete-time SSMs, this messagepassing scheme is particularly attractive for continuous-time SSMs asclosed-form continuous-time input-estimation formulas can be obtained(see the interpolation formula in Section 5.2).To compute the smoothing most general case, i.e, smoothing, forwardmessage passing with −→VX(t) and −→mveX(t) is done as in [10, (II.1) and(II.2)].For the backward sweep, use (IV.1) and (IV.2) with eAT instead of A,while the “=”-factor updates are (IV.7) and (IV.8) and do not change.Posterior probabilities of X(t) for any tk as in step 5) in Algorithm 4.1and at any tk ≤ t ≤ tk+1 by computing first −→VX(t) with [10, (I.2)] andthen (4.2), whereas for mX(t) the formula

mX(t) = eA(t−tk)(mX(tk) − σ2

UWX(tk)µX(tk)

∫ t−tk

0eAτbbTeAτdτ

)(4.10)

may be used. If A is diagonalizable, the integral term can be expressedin closed-form as done in [10, Section IV.B]. Note that (4.10) yields aclosed-form expression formX(t) which might be used as an interpolationformula (cf. Theorem 5.1).A complete algorithm statement is given in the Appendix in Section A.2.Note also that, in a stationary situation with constant intervals tk+1 −tk, the covariance matrices −→VX(tk) and ←−VX(tk) (and thus W (tk)) do notdepend on k and may be computed offline, as is usual in Kalman filtering.

Page 64: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

4.4 Conclusion 49

4.4 Conclusion

We have presented robust and computationally efficient smoothing-typealgorithms in SSMs based on parametrization of Gaussian densities withdual precisions. While the standard Kalman smoother with dual preci-sions formalizes the Modified Bryson–Frazier Kalman smoother, fixed-interval input estimation algorithms that derive from dual-precision mes-sage passing are novel. We expand further on dual-precision based inputestimation in Chapter 5 and Chapter 8.We expect that these message passing algorithms might motivate the de-velopment of further dual-precision Gaussian message passing algorithmsbeyond Kalman smoothing and input estimation.

Page 65: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 66: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 5

Input Signal Estimation

In this chapter, we report several new theoretical and experimental re-sults on the continuous-time input estimator proposed in [9,10] and thatwe present in Section 5.1. In the first part, we show, in particular, thatthe continuous-time estimate is smooth (i.e., continuous and infinitelyoften differentiable) between sampling times. In fact, the smoothness isobvious from a new expression for the estimate between sampling timesthat is attractive also for practical computations. We also state andprove a condition on the SSM that is necessary and sufficient to yieldinput estimates that are continuous everywhere. In addition, we give aWiener-filter version of the the present estimator that further illustratesthe nature of this estimate. In the experimental part (Section 5.5),we report not only simulation results, but also continuous-time estima-tion results using dynamometer measurements presented in more detailin Chapter 6.

5.1 Preliminaries

Let U(t) and Y (t) be the real-valued continuous-time input signal andoutput signal, respectively, of some sensor. We are given noisy samples

Yk , Y (tk) + Zk (5.1)

of Y (t) at discrete moments tk, k ∈ Z, (with tk < tk+1), where Zk(the measurement noise) are iid Gaussian random variables that areindependent of U(t) and Y (t). From these samples, we wish to estimateU(t).

51

Page 67: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

52 Input Signal Estimation

We will assume that the sensor is given by a finite-dimensional stablelinear (continuous-time) SSM with state X ∈ Rd evolving according to

dX(t) = AX(t) dt+ bU(t) dt (5.2)

with A ∈ Rd×d and b ∈ Rd×1, and with

Y (t) = cX(t) (5.3)

with c ∈ R1×d. A number of generalizations of this setting will beindicated at the end of Section 5.3.Problems of this sort are usually addressed by beginning with addi-tional assumptions on U(t). However, in many practical applications,the actual sensor input signal is, strictly speaking, neither bandlimitednor sparse: better sensors might reveal ever more details in the signal.Nonetheless, we need to cope with the given sensor, as well as we can.A new approach to such estimation problems was proposed (among otherthings) by Bolliger et al. in [10]. In this approach, U(t) is modeled aswhite Gaussian noise (WGN)—not because the unknown true input sig-nal is expected to resemble WGN, but to avoid unwarranted assumptionson its spectrum. It is shown in [10] that modeling U(t) as WGN leadsto a practical estimator that is easily computed on the basis of forward-backward Kalman filtering/smoothing.The definition of the estimate u(t) from [10] can be paraphrased as fol-lows. For ∆ > 0, let

U(t,∆) , 1∆

∫ t

t−∆U(τ) dτ. (5.4)

If U(t) is a continuous signal, then lim∆→0 U(t,∆) = U(t). Assume nowthat U(t) is white Gaussian noise. Then, for fixed t, U(t,∆) is a well-defined zero-mean Gaussian random variable with variance σ2

U/∆ forsome constant σ2

U > 0. The MAP/MMSE/LMMSE estimate of U(t,∆)from observations Yk = yk for all k. is

ˆu(t,∆) = E[U(t,∆)|y1, y2, . . .

], (5.5)

and u(t) is defined asu(t) , lim

∆→0ˆu(t,∆). (5.6)

The limit can be shown to exist and practical computation of u(t) willbe presented next.

Page 68: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

5.2 Computation and Continuity of u(t) 53

−→µX(t)

X(t)eA∆ +

X(t′)

←−µX(t′)

N (0,V∆)

Figure 5.1: Factor graph for interpolation with t′ = t+ ∆ > t.

5.2 Computation and Continuity of u(t)

The estimate (5.6) first given in [10] is paraphrased here using messagesfrom Chapter 4

u(t) = σ2Ub

TWX(t)µX(t). (5.7)

Computation of posterior probabilities p(x(t)|y0, y1, . . .

)may be done

with Gaussian message passing as in Section 4.3.4 (cf. Algorithm A.2).For numerical computations of the input estimate u(t), it may be prefer-able to use the formula (5.7) only for t ∈ {tk}, and to use the newinterpolation formula (5.8) for any intermediate moments t.

Theorem 5.1: InterpolationAssume that both t and t′ = t+ ∆ lie between adjacent sampling timestk and tk+1, i.e., tk < t ≤ tk+1 and tk < t′ ≤ tk+1. Then

u(t′) = σ2Ub

Te−AT∆ WX(t)µX(t). (5.8)

Note that ∆ may be negative. It is obvious that (5.8) is a smoothfunction of ∆, which proves that u(t) is smooth between sampling times.The proofs of the two theorems in this section use factor graphs and arepresented in Appendix B.3 on page 137.

Theorem 5.2: Continuity at sampling timesIf cTb = 0, then u(t) as in (5.6) is continuous also for t ∈ {tk}.

Conversely, if cTb 6= 0, then u(t) is generically not continuous for t ∈{tk}, as is evident from many examples.

Page 69: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

54 Input Signal Estimation

5.3 Wiener Filter Perspective

In a stationary situation with tk = kT for fixed sampling time T > 0, theMAP/MMSE/LMMSE estimate (5.6) can also be obtained via a versionof a Wiener filter [38] as follows. Let G(ω) be the frequency responseof the sensor, i.e., the Fourier transform of the impulse response of thesystem given by (5.2) and (5.3). Let z denote the complex conjugate ofz ∈ C.We assume a slightly more general setting in the next theorem and in-troduce a input-noise shaping filter N(ω).

Theorem 5.3Under the stated assumptions,

u(t) = T

∞∑k=−∞

Ykh(t− kT ) (5.9)

where h(t) is given by its Fourier transform

H(ω) = G(ω)∑k∈Z

∣∣G(ω + k 2πT )∣∣2 + σ2

N

σ2U

(5.10)

Let n(t) be the input estimate, as described above, but let the input-processed be colored, i.e., filtered with N(ω). The filter h(t) in

n(t) = T

∞∑k=−∞

Ykh(t− kT ) (5.11)

is then given by

H(ω) = |N(ω)|2G(ω)∑k∈Z

∣∣N(ω + k 2πT )G(ω + k 2π

T )∣∣2 + σ2

N

σ2U

(5.12)

Mixed discrete/continuous-time Wiener filters as in (5.9) are not usuallycovered in textbooks, and we have not yet been able to find filters similarto (5.10) in the literature. In any case, a proof deriving from the orthog-onality principle of LMMSE estimation [38] is given in Appendix B.3 onpage 137.

Page 70: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

5.4 Postfiltering 55

Theorem 5.3 can be used as an alternative to message-passing in theSSM followed by (5.8) for numerically computing the continuous-timeinput estimate.A main point of Theorem 5.3 is that it further illuminates the nature ofthe estimate (5.6). For example, consider the two amplitude responses|G(ω)| in Figure 5.2. The dashed lines in this figure show the aliasingterm |G(ω + 2π

T )|. As long as such aliased parts of G(ω) remain sub-stantially below the noise-to-signal ratio σ2

N/σ2U , the aliasing does not

materially affect the estimate u(t).It should be noted, however, that the Kalman-filter approach (5.7) ismore general than the Wiener filter of Theorem 5.3. In particular, theKalman-filter approach works also for nonuniform sampling, and it gen-eralizes easily to time-varying systems (e.g., unstable systems under dig-ital control as in [46]) and to mildly nonlinear systems (via local lineari-sation as in extended Kalman filtering [38]).

5.4 Postfiltering

While the estimate (5.6) is piecewise continuous (and smooth betweensampling times), the user of the sensor may sometimes prefer a smoother-looking estimate of U(t). In this case, the estimate (5.6) may simply bepassed through a suitable low-pass filter (preferably with linear phaseresponse).Such postfiltering is similar, in effect, to estimating U(t) under the tra-ditional assumption that it is bandlimited. However, the results of thetwo approaches are not identical, as is easily seen from (5.10). Moreover,in a Kalman filter setting (as in Section 5.2), the traditional assumptionrequires an SSM of the noise-shaping filter to be included in the Kalmanfilter, which increases its complexity; by contrast, postfiltering the esti-mate (5.7) does not affect the Kalman filtering at all.It should be noted that postfiltering (5.6) is, in principle, a continuous-time operation. In the following we propose an approach to obtain ana-lytic expressions of postfiltering and numerically compute them.

Page 71: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

56 Input Signal Estimation

Theorem 5.4: Analytical Computation of PostfilteringConsider the augmented continuous-time SSM with state [XT,WT]Twith additional state W ∈ Rp and state-space parameters

A ,(

A 00 F

), b ,

(b

e

), (5.13)

c ,(c 0

), (5.14)

with F ∈ Rp×p and e ∈ Rp×1.Then the signal

v(t) =(0 d

)X(t),

with d ∈ R1×d, corresponds to continuous-time estimator (5.6) with post-filtering by a filter P (ω), given by

P (ω) = d (jωI− F)−1e. (5.15)

Theorem 5.4 can be seen from the factor graph representation of cont-inuous-time SSMs as follows: The factor graph representation of theaugmented SSM (5.13) can be split into two graphs that are only con-nected at the input edge U(t) and the graph representing the post filteris purely deterministic. Trivially, the deterministic factors in the post-filtering graph are fulfilled for any U(t). Hence, the augmented modelyields the same estimate u(t) and hence, (5.15) follows from the Fouriertransform of an SSM.The benefit of (5.13) is that analytical expressions for postfiltering withany continuous-time filter1 of finite degree and with relative degree atleast 1 follow easily from (5.7) or (5.8). Furthermore, observe that eAtis a block-diagonal matrix for all t ∈ R, thus reducing the necessarycomputational effort.The second approach computes the input estimate in the reduced systemand uses (5.8) to obtain u(t). Then we use the post filter expressedas a continuous-time SSM in autoregressive form. Using vectorizationoperations we can express the postfiltering result between two sampling.

Page 72: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

5.5 Experimental Results 57

0 0.1 0.2 0.3 0.4 0.5�50

�40

�30

�20

�10

0

10

frequency [f/fs]

mag

nitu

de[d

B]

4th order2nd order

Fig. 1. Frequency response of the evaluated state-space mod-els.

Figure 5.2: The amplitude response of two sensors of order 2 and 4, re-spectively. The horizontal axis is the normalized frequencyf/fs with fs , 1/T . The dashed lines show the aliasing.

5.5 Experimental Results

Figure 5.2 shows the amplitude response of two different sensor models,one of order n = 2 and the other of order n = 4, which both originatefrom fitting low-order models to the measured impulse response of real-world sensors in an industrial setting. The 4th-order model turns outto satisfy cTb = 0 while cTb 6= 0 for the 2nd-order model. Figs. 5.3–5.5show simulation results with these models for high signal-to-noise ratioσ2U/σ

2N . Note that the input signal in these simulations has effective

discontinuities, which is not uncommon for real signals (e.g., forces whenmoving objects collide). Note also that the input signal is nonnegative,of which the estimator is ignorant.Figure 5.5 shows the input estimate for the 2nd-order model in micro-scopic detail. It is apparent that the estimated signal is not continuous atthe sampling moments, which is due to the fact that cTb 6= 0 in this case(cf. Theorem 5.2). By contrast, the estimate in Figure 5.3 is continuouseverywhere in consequence of Theorems 5.1 and 5.2.Figure 5.6 illustrates the use of the 4th-order model with measured real-

1These type of filters can be represented by a continuous-time SSM by using thetransfer function to obtain the autoregressive form representation of the SSM.

Page 73: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

58 Input Signal Estimation

world data. In this case, the true input signal is not known, but a better(more expensive) reference sensor provides a good guess of it. Moreover,the actual sensor dynamics has slightly changed during operation whilethe estimation uses the unchanged 4th-order model. The estimate issmoothed by postfiltering as in Section 5.4. Due to the uncertaintiesof the situation, it is difficult to assess the quality of the estimate inabsolute terms, but it can certainly be concluded that the estimatorworks very well in practice.(The estimated signal in Figure 5.6 is sometimes clearly negative. Thisseems to be due to a bias in the measurement set-up, not due to aproblem with the estimator.)

5.6 Conclusion

We have presented a number of new theoretical and experimental re-sults on the (continuous-time) input-signal estimator that was proposedin [10]. In particular, we have established continuity (or piecewise con-tinuity if cTb 6= 0) of the estimate. A key step leading to this newinsight, has been an interpolation formula for the continuous-time in-put signal estimate. To obtain these results, the dual-precision messagerepresentation has been leveraged, from which the results follow. Wehave complemented the Kalman-filter perspective with a Wiener-filterperspective, which may prove useful in analysis of the proposed inputestimator. Practicality of the proposed estimator was confirmed with ex-perimental results and we have elaborated on postfiltering options andtheir effects on the estimate.

Page 74: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

5.6 Conclusion 59

0 50 100 150 200

0

[samples]

[arb

.uni

ts]

estimated input signalactual input signalobserved output signal

Fig. 3. Input signal estimation method applied to simulateddata.

0 50 100 150

0

[samples]

[arb

.uni

ts]

measured reference signalestimated input signalmeasured ouptut signal

Fig. 4. Input signal estimation method applied force sensormeasurements during machining.

[1] L. Bolliger, H.-A. Loeliger, and C. Vogel, “Simula-tion, MMSE estimation, and interpolation of sampledcontinuous-time signals using factor graphs,” 2010 In-formation Theory & Applications Workshop, UCSD, LaJolla, CA, USA, Jan. 31 – Feb. 5, 2010.

[2] L. Bolliger, H.-A. Loeliger, and C. Vogel, “LMMSEestimation and interpolation of continuous-time sig-nals from discrete-time samples using factor graphs,”arXiv:1301.4793v1.

[3] L. Bolliger, Digital Estimation of Continuous-Time Sig-nals Using Factor Graphs. PhD Thesis No. 20123 atETH Zurich, 2012.

[4] H.-A. Loeliger, J. Dauwels, Junli Hu, S. Korl, Li Ping,and F. R. Kschischang, “The factor graph approachto model-based signal processing,” Proceedings of theIEEE, vol. 95, no. 6, pp. 1295–1322, June 2007.

[5] J. Dauwels, A. Eckford, S. Korl, and H.-A. Loeliger,“Expectation maximization as message passing – Part I:principles and Gaussian messages,” arXiv:0910.2832.

[6] . . .

• demonstrate use for input signals that are neither band-limited nor sparse (Toronto slides or similar); demon-strate interpolation

• demonstrate use in a practical application (force sensor)

• . . . including identification by EM

• . . . with or without additional spectral shaping

• estimation without measurement noise makes sense

• comparison with inverse filter

• spectral characterization in a special case

• continuity of u(t)

• correct proof of Theorem 2

• avoid aliasing by assuming (or actually adding) addi-tional measurement noise??

...

Advantages of proposed state space method:

• No assumption about input signal (noisy case: power)

• Can combine multiple measurements & irregular mea-surements

• Arbitrary temporal resolution of estimate

• Straightforward extension to mildly nonlinear models

Figure 5.3: Input signal estimation (simulation) for 4th-order model ofFigure 5.2 with σ2

U/σ2N = −41dB.

0 50 100 150 200

0

[samples]

[arb

.uni

ts]

estimated input signalactual input signalobserved output signal

Fig. 5. Input signal estimation method applied to simulateddata.

0 0.1 0.2 0.3 0.4 0.5�50

�40

�30

�20

�10

0

10

frequency [f/fs]

mag

nitu

de[d

B]

4th order2nd order

Fig. 6. Frequency response of the evaluated state-space mod-els.

Figure 5.4: Input signal estimation (simulation) for 2nd-order modelof Figure 5.2 with σ2

U/σ2N = −31dB.

Page 75: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

60 Input Signal Estimation

-X(tk)

N (0,�2U/�)

?U(tk,�)

b�

?+ - =

?cT

?Y (tk)

-X(tk)

-X(tk)=

?cT

?Y (tk)

- + -X(tk)

N (0,�2U/�)

?U(tk,�)

b�

?

-X(tk)=

?cT

?Y (tk)

- eA� - + -X(tk + �)

N (0,�2U/�)

?U(tk + �,�)

b�

?

Fig. 2. For the proof of Theorem 2.

100 102 104 106 108 110

0

[samples]

[arb

.uni

ts]

estimated input signalactual input signalobserved output signal

Fig. 7. Closeup of5, where a jump occurs in the input signal.

Parseval’s relation. Under the given assumptions, the secondterm can also be transformed into frequency domain

Z ⇡

�⇡

�����Y (ei⌦)�X

k2ZH(! + k

2⇡

T)U(! + k

2⇡

T)

�����

2

d!. (18)

The cost funciton has to be minimized for each ! separately.Therefore, assume first that U(! + k 2⇡

T ) = 0 for kkk > Kand for any ! 2 [0, 2⇡]

�����Y (ei⌦)�KX

k=�K

H(! + k2⇡

T)U(! + k

2⇡

T)

�����

2

+�

KX

k=�K

����U(! + k2⇡

T)

����2

.

(19)This corresponds to a regularized least-squares problem, withminimizer

U(!) =H(!) Y (ei!T )

PKk=�K

��H(! + k 2⇡T )

��2 +�2

Z

�2U

. (20)

One obtains ?? by taking K !1.

4. NEW EXPERIMENTAL RESULTS

. . . in addition to those of [3]. estimated input signal

...

5. CONCLUSION

. . .

6. REFERENCES

Some of the cited papers are available viahttp://people.ee.ethz.ch/˜loeliger/.

Figure 5.5: Close-up of Figure 5.4 around a jump of the input signal.

0 50 100 150 200

0

[samples]

[arb

.uni

ts]

estimated input signalactual input signalobserved output signal

Fig. 3. Input signal estimation method applied to simulateddata.

0 50 100 150

0

[samples]

[arb

.uni

ts]

measured reference signalestimated input signalmeasured ouptut signal

Fig. 4. Input signal estimation method applied force sensormeasurements during machining.

[1] L. Bolliger, H.-A. Loeliger, and C. Vogel, “Simula-tion, MMSE estimation, and interpolation of sampledcontinuous-time signals using factor graphs,” 2010 In-formation Theory & Applications Workshop, UCSD, LaJolla, CA, USA, Jan. 31 – Feb. 5, 2010.

[2] L. Bolliger, H.-A. Loeliger, and C. Vogel, “LMMSEestimation and interpolation of continuous-time sig-nals from discrete-time samples using factor graphs,”arXiv:1301.4793v1.

[3] L. Bolliger, Digital Estimation of Continuous-Time Sig-nals Using Factor Graphs. PhD Thesis No. 20123 atETH Zurich, 2012.

[4] H.-A. Loeliger, J. Dauwels, Junli Hu, S. Korl, Li Ping,and F. R. Kschischang, “The factor graph approachto model-based signal processing,” Proceedings of theIEEE, vol. 95, no. 6, pp. 1295–1322, June 2007.

[5] J. Dauwels, A. Eckford, S. Korl, and H.-A. Loeliger,“Expectation maximization as message passing – Part I:principles and Gaussian messages,” arXiv:0910.2832.

[6] . . .

• demonstrate use for input signals that are neither band-limited nor sparse (Toronto slides or similar); demon-strate interpolation

• demonstrate use in a practical application (force sensor)

• . . . including identification by EM

• . . . with or without additional spectral shaping

• estimation without measurement noise makes sense

• comparison with inverse filter

• spectral characterization in a special case

• continuity of u(t)

• correct proof of Theorem 2

• avoid aliasing by assuming (or actually adding) addi-tional measurement noise??

...

Advantages of proposed state space method:

• No assumption about input signal (noisy case: power)

• Can combine multiple measurements & irregular mea-surements

• Arbitrary temporal resolution of estimate

• Straightforward extension to mildly nonlinear models

Figure 5.6: Input signal estimation from real-world measured data us-ing the 4th-order model of Figure 5.2 and postfiltering asin Section 5.4.

Page 76: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 6

Input Estimation forForce Sensors

Unwanted vibrations during machining are a well-known fact and cause anumber of problems ranging from increased tool wear and tool breakageto poor surface finish and inferior product quality. As a consequence,for a wide-range of applications, it is essential to monitor forces accu-rately [55]. However, structural sensor modes can distort these measure-ments. Oftentimes even the whole sensor-machine mechanical system(see Figure 6.1a), causes mechanically interference with the sensor andcauses new modes to appear. Especially, in high-speed machining pro-cesses, such distorted sensor readings are limiting applications.A popular method to remove these disturbances is by low-pass filtering.On the other hand, this simple method also removes relevant dynamicalinformation when excitation frequencies are high.In [54, 55] a continuous-time Kalman filter is applied to compensate forunwanted resonating modes in dynamometer measurements. The filteris static (i.e., non adaptive) and derived through mode matching in thespectrum. A different approach is shown in [8, 27, 50], where first arational finite-order transfer function is fitted to a sensor’s frequencyresponse. Filtering of the measurements is utilizes the inverted transferfunction and is realized directly in the frequency domain or with IIRfilters.The aforementioned methods derive relatively high-order filters from afrequency-domain view of the filtering problem. We propose a differentapproach by using a model-based filter with low-order order to compen-sate for unwanted dynamics. An automatic model-identification scheme

61

Page 77: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

62 Input Estimation for Force Sensors

machine

sliding table

dynamometerreferencedynamometer

workpiece

(a) Schematic of the mechanicalsetup (experiments)

massm2

massm1

k2 ξ2

k1 ξ1

(b) Multi-mass model

ensures that the filter adapts to new sensor settings. Model-identificationand filtering are derived in time domain.

6.1 Problem Setup and Modeling

The machining setup under consideration is schematically shown in Fig-ure 6.1a. A few physical phenomenas lead to corrupted dynamometersensor readings. Firstly, the physical sensor itself does, independently ofthe overall machining setup, suffer from resonant modes at certain char-acteristic frequencies. Additionally in the present machining setup, sen-sor, workpiece, and components of the machine form a complex coupledmechanical system: the machine-workpiece loop. Mechanical couplingin the machine-workpiece loop causes distorted sensor readings throughinertial forces. Both effects severely limit the use of estimation methodssolely based on a-priori known sensor transfer characteristics as in [54,55]for practical dynamometer filtering tasks and necessitate identificationmethods and compensation methods that account for vibrations that arenot necessarily due to the sensor itself.We approach this problem by taking a model-based stand; it relies onthe hypothesis that a low-order sensor model captures the salient char-acteristics of the real sensor and thereby, allows us to obtain accurate

Page 78: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.1 Problem Setup and Modeling 63

Figure 6.1: Schematic of the presented dynamometer signal processingsetup.

force estimates through model-based signal processing methods.Our model-based view on the problem is rooted in multi-mass swingermodels (see also Appendix F) as shown in Figure 6.1b. In Figure 6.1bthe lighter mass m2 with spring k2 and damping element ξ2 repre-sents the sensor fixed to a heavy machine with mass m1. This sim-ple physical model contains many phenomena observed in experimen-tal measurements of dynamometer sensors (see e.g., Figure 6.7). First,mechanical coupling will introduce resonating poles, far below the sen-sor’s lowest resonance frequency. Secondly, these additional modes formresonance/anti-resonance pairs: a complex-valued pole pair and a close,but at a higher frequency, complex-valued zero pair. Thirdly, since thesepole-zero pairs are caused by the machine itself, they might be verysusceptible to small changes in the setup, e.g., starting the machiningprocess.We conclude that the main factor that impairs the force estimates is notnoise, but model uncertainties (cf. Section 6.5). These distortions will

Page 79: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

64 Input Estimation for Force Sensors

appear in the low-frequency range and in combination with the limitedbandwidth of actual force signals, will thus be highly relevant towardsgood performance.Our model-based approach tries to capture the most prominent compo-nents of this mechanical systems, which is assumed to be linear, by iden-tifying a low-order discrete-time SSM based on identification measure-ments. Estimation of this SSM is done by means of the EM algorithm.In this case these measurements are impulse response measurements, butthe presented solution does not rely on this property. Once an SSM isidentified, input estimation from sampled sensor measurements yields anestimate of the actual force.In order to validate the proposed methods, an additional more accuratesensor reference is employed during experiments. A schematic of thewhole setting is shown in Figure 6.1

6.2 Model-Based Input Estimation

The aim of the presented input estimation method, is estimation ofthe machining forces acting on the workpiece by means of dynamome-ter measurements. We assume here that the workpiece-machine loopis represented well with a (low-order) SSM. Since sampling rates, aremuch higher than the frequency range of interest (i.e., dynamometerresonance frequencies and spectrum of the force signal), identificationand consequently input estimation are performed by means of discrete-time SSMs. We also focus on one-dimensional force signals (i.e., 1-axismeasurements). Extensions for processing multi-dimensional force mea-surements are discussed and presented in Section 6.4.

6.2.1 Model-Based Input Estimation

For any K ∈ N let u1, . . . , uK ∈ R denote the unknown samples of theapplied force and y1, . . . , yK ∈ R the scalar sensor measurements fromthe force sensor sampled with period Ts. The goal is to find an estimateUk of Uk using SSMs.An MMSE/LMMSE/MAP input estimation method is used and requirestwo modeling assumptions: First, the force signal is resembles (statis-tically) to a correlated Gaussian random process and second, input-

Page 80: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.2 Model-Based Input Estimation 65

Gaussian inputprocess prior

Uk

f

Sensor model/workpiece model

+

Nk

Sensor measurements

Yk

Figure 6.2: Stochastic model used to develop a model-based filter forestimation of the random input Uk given measurementsof Yk.

relevant dynamics of the sensor are captured well by a low-order SSM.Experimental results confirm that, both assumption hold rather well.The resulting model is shown in Figure 6.2. A continuous-time priormodels the force U(t) acting on the sensor. Since model-identificationyields discrete-time sensor models, input estimation with the estimatorfrom Chapter 5 estimates Uk = U(tk) i.e., the input signal at discretetimes t1, . . . , tK . Finally, sensor measurements Yk corresponding to sen-sor readings are presumed to be corrupted by white Gaussian noise withvariance σ2

N .The SSM representation of Figure 6.2, is obtained by concatenating anSSM with parameters A(p), B(p), and c(p) encoding prior assumptionson the actual force with the SSM of the sensor model. An additionalstate Uk accounts for large random changes1 in the input signal Uk, asthey are typical during machining.Denote the SSM parameters of the sensor model as A(s), b(s), and c(s).The complete SSM is then:

Xk+1 =

A(p) 0 00 1 0

b(s)c(p) b(s) A(s)

Xk +

B(p) 00 ε

0 0

Zk (6.1)

Yk =(0 0 c(s))Xk +Nk (6.2)

1 Such jumps in the signal may be detected from Uk using a glue factor asdescribed in [59]. However, we do not consider this problem further.

Page 81: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

66 Input Estimation for Force Sensors

and the entries of the state vector Xk are defined as follows

Xk =

X

(U)k

X(U)k

X(s)k

. (6.3)

The input signal offset X(U)k and the states of the spline prior (cf. Ap-

pendix C) are driven by white Gaussian noise Zk with variance σ2U ,

where ε is small to allow the offset state to slowly change over time.An estimate of the input is eventually obtained from the a-posteriorimean value, which is easily seen from (6.3) to be given by

Uk =(cs 1 0

)︸ ︷︷ ︸,Π

Xk.

6.2.2 Input estimator implementation

The input estimate, i.e., the posterior mean E[Uk |y1, . . . , yK ], requiresthe marginals2p(uk|y1, . . . , yK). Efficient computation of these marginalsis obtained by implementing the smoothing algorithm from Section 4.3.1using the SSM (6.1). Specifically, a forward recursion based on themessages −→V and −→m followed by a backward recursion to compute W andWµare executed. Eventually, the force estimate u1, . . . , uK is computedfrom

uk , ΠmXk.

Since signals are typically long (K � 1) and the physical model can beassumed to be stationary, computational complexity can be reduced sig-nificantly by using the steady-state smoothing method presented in Sec-tion 4.3.2.

2 Our implementation does batch processing. Online processing algorithms arederived likewise, as shown in e.g., [47]. The main idea is to use a sliding-windowapproach and introduce a processing delay. The length of the processing delay canbe used to trade estimation performance (i.e., smoothing) and computational com-plexity (sliding window length).

Page 82: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.3 Sensor Model Identification 67

Remark 6.1Utilizing steady-state message passing, the smoothing is, in fact, thesum of a causal IIR filter and an anti-causal IIR filter. The order ofboth filters is the order of the employed SSM.

To improve numerically stability of the implementation, the SSM (6.1)may be expressed in another state-space basis Xk = TXk. It followsthat the input estimator Π must be transformed to ΠT−1. In the presentimplementation, this additional step was not necessary.

Parameters The proposed model offers two parameters to adapt in-put signal estimates to varying filtering settings. First, the order of thespline prior may be increased (decreased) to get smoother (fast-varying)estimates. The second parameter is the ratio λ , σ2

U/σ2N . By inspection

of the variational representation of the input estimator [10, Theorem 2],it is evident that the input estimate uk only depends on this ratio, asopposed to the absolute values of the variances σ2

U or σ2N . In practi-

cal applications, this ratio may be used to adjust the resulting inputestimates to the given sensor measurement quality.

6.3 Sensor Model Identification

As outlined in Section 6.1, the goal is to infer a dynamometer modelfrom L input-output measurements u(1), . . . ,u(L) and y(1), . . . ,y(L). Inthe following we consider the case L = 1 i.e., only one input and outputsignal set is available, and as a consequence forgo the correspondingnotation. The generalization to L > 1 is deferred to the end of thedescription.We seek an identification scheme that focuses on the final usage of thesystem model: In this case, it is input estimation. To the best of ourknowledge, there are no results on identification methods that are suitedfor input estimation tasks. We show a novel variational statistical modelthat prefers system models with better (expected) input estimation per-formance and is therefore better suited for the given task than standardmethods as in e.g., [43].To elaborate on this question, consider a stationary setting. Then input-

Page 83: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

68 Input Estimation for Force Sensors

estimation as in Section 6.2.1 with input-power-to-noise

λ , σ2N/σ

2U

according to Proposition 5.3 corresponds to a Wiener filter

G(ejθ) ,¯H(ejθ)

|H(ejθ)|2 + λ,

with H(ejθ) the frequency-response of the sensor model. Inevitably, thesensor model will only recover a corrupted version of the actual discrete-time sensor frequency response H(ejθ) to a sinusoid at circular frequencyθ, thus

H(ejθ) = H(ejθ) + ε(ejθ). (6.4)

Application of this filter then results in

G(ejθ)H(ejθ)U(ejθ) =ˆH(ejθ)H(ejθ)U(ejθ)|H(ejθ)|2 + λ

=|H(ejθ)|2

(1 + ε(ejθ)

¯H(ejθ)

)U(ejθ)

|H(ejθ)|2∣∣∣1 + ε(ejθ)

H(ejθ)

∣∣∣2 + λ

= 11 + ε(ejθ)

¯H(ejθ)

U(ejθ)1 + λ

|H(ejθ)|2

≈ U(ejθ)1 + λ

|H(ejθ)|2︸ ︷︷ ︸MMSE estimate

1− ε(ejθ)H(ejθ)︸ ︷︷ ︸

model error

, (6.5)

while the last approximation holds when ε(ejθ) � 1. Hence, a relativemodeling error ε(ejθ)/H(ejθ) ≈ ε(ejθ)/H(ejθ) results in an (absolute)estimation error of the same magnitude. We conclude that for the taskat hand, system identification should try to minimize the relative modelerror.A variety of system identification methods might be assessed based onthis finding: Subspace methods, prediction-error methods, and ML es-timators [43, 70]. Prediction error methods [43], minimize a (weighted)

Page 84: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.3 Sensor Model Identification 69

N (0,VX1)=

c

N (y1, σ2N )

Y1

AX1 +

b

N (u1, σ2U )

N1

= · · ·X2

c

N (y2, σ2N )

Y2

Figure 6.3: First segments of the factor graph corresponding to theprobability density used in Theorem 6.1.

absolute model error, which might be very different from the relativemodel error. Subspace methods lack a characterization with an errorcriterion. On the other hand, maximum-likelihood methods naturallyoffer the required properties as will be demonstrated by the followingtheorem, which is proven in the Appendix B.

Theorem 6.1: ML Cost FunctionConsider a linear single-input single-output SSM as shown in Figure 6.3and define its K-point discrete Fourier transform (DFT) as

S[k] = cT(e−j2πk/KI−A

)−1b.

Also, let u1, . . . , uK and U [1], . . . , U [K] be the input signal and its DFTand likewise y1, . . . , yK and Y [1], . . . , Y [K] the output signal and itsDFT.If K � 1 the (scaled) log-likelihood L(A, b, c) , −2 log p(y,u|A, b, c)fulfills

L(A, b, c) ≈K∑k=1

log(|S[k]|2 + λ) + 1σ2U

|Y [k]− S[k]U [k]|2

λ+ |S[k]|2 + C, (6.6)

Page 85: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

70 Input Estimation for Force Sensors

Hammerimpulse

+

N (0, σ2U )

EkUk Nk

f

Sensor model/workpiece model

+

Nk

N (0, σ2N )

Impulse responsemeasurements

Yk

Figure 6.4: Schematic of the variational statistical model proposed formodel estimation from impulse response measurements.The filled black boxes represent given input measurementsand output measurements. The modeled identification in-put signal is denoted Nk.

where C is an (approximately) constant term independent of the param-eters A, b, and c.

Now, assume that Y [k] = H[k]U [k] +N [k], where N [k] is noise for DFTbins k ∈ [1,K]. We now illustrate that the ML’s behavior is similar tothe desired one (6.5), when the signal is long enough for Theorem 6.1to (approximately) hold. To simplify the exposition we also assumethat σ2

U � 1. Then, as the second term in (6.6) is weighted much morethan the log term, the log term is ignored. The maximum-likelihoodestimator of the SSM parameters in Figure 6.3 thus minimizes

minS[1],...,S[K]

K∑k=1

|ε[k]U [k] +N [k]|2

λ+ |S[k]|2 (6.7)

with ε[k] as in (6.4) dependent on S[k]. From the last expression it isevident that the squared error in the numerator is weighted by 1/|S[k]|2and hence, as the relative model error in (6.5).In addition, observe that for k where λ � |S[k]|2 holds, the MMSEestimate in (6.5) is shrunken towards 0 and the ML estimator limits theweighting factor at 1/λ. This makes sense, as the estimate S[k] oughtnot to fit a noisy estimate for DFT bins, where the estimate will barelyinfluence the MMSE estimate.

Page 86: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.3 Sensor Model Identification 71

In conclusion, when the goal is to identify a sensor model that is af-terwards used for input estimation, maximum-likelihood estimators area suitable choice. Henceforth, we seek to compute (approximatively)the maximum-likelihood estimate from the variational statistical modeldepicted in 6.4 with parameters σ2

U and λ.

6.3.1 Tuning of Identification Parameters

In the present maximum-likelihood problem Figure 6.4, we consider thenoise variances as algorithmic tuning parameters3.From Theorem 6.1 we make the following observations that guide pa-rameter choices:

• Tuning of σ2U

Varying σ2U while holding λ fixed changes the relative weight of

the logarithmic term with respect to the model estimation error.The logarithmic term in the optimization objective encourages thatsmall S[k] are driven towards zero i.e., shrinkage of the smallerterms.

The parameter σ2U may be used to adjust characteristics of the

EM algorithm employed to compute the ML estimate. Specifically,for 1 � σ2

U (observations are on the order of 1) the EM updatesbecome numerically more robust, since in the EM message theinfluence of the covariance term decreases (cf. (A.1) and (A.3))and covariances are more (numerically) error-prone than vectors(cf. Section 3.1).

Additionally, the choice 1 � σ2U has been observed to increase

convergence rates of EM.

• Tuning of λ

As seen from (6.6), the ML objective essentially caps the weightonce attenuation of the SSM falls below λ. The parameter maythus be tuned specifically to prescribe when the ML estimate mustpay less attention to data fit.

3This is in contrast to deterministic values or random variables that may beinferred from the probability model.

Page 87: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

72 Input Estimation for Force Sensors

6.3.2 Implementation

Before applying an EM algorithm to compute an approximate solutionto the ML problem, the measurement data is appropriately scaled andonly the signal window is used, where there is significant power in thesensor measurements.We apply the EM algorithm, as outlined in [16], to our model Fig-ure 6.4 and use the AR form of the SSM. The algorithm is stated inAppendix A.3 on page 128.

6.3.3 Improvements of Model-Identification Algorithm

In the following, we particularize our model identification method inAlgorithm A.3 and present various extension that proved useful in thecurrent application.

Initialization

Since the EM iteratively optimizes a non-convex objective function, re-sults can vary depending on the initial parameters that the algorithmstarts with. One possibility to initialize the EM algorithm is to obtain afirst estimate with another system identification method. As suggestedin [75] subspace system identification methods are well suited, as theydo not require an initial estimates themselves.When identifying SSMs for dynamometer sensors, another approach,which is employed in our final implementation, is to leverage prior knowl-edge on the model characteristics (e.g., sensor’s data sheet); given fre-quencies of the resonance anti-resonance pairs and the resonance it-self, an initial SSM is generated as sketched in Appendix F. The modefrequency is set according to the dynamometer’s data sheet and theresonance/anti-resonance pairs are equally spaced in the frequency rangeup to the resonance mode.

Multiple Independent Measurements

Multiple input-output measurements of the sensor u(1), . . . ,u(L) andy(1), . . . ,y(L) can serve to improve the quality of the identified model ormake the method more robust in practical applications.

Page 88: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.3 Sensor Model Identification 73

N (0,VX

(1)1

)

fS

N (y(1)1 , σ2

N )

N (u(1)1 , σ2

U )

fS

X(1)2

N (y(1)2 , σ2

N )

N (u(1)2 , σ2

U )

· · ·X

(1)3

N (0,VX

(2)1

)

fS

N (y(2)1 , σ2

N )

N (u(2)1 , σ2

U )

fS

X(2)2

N (y(2)2 , σ2

N )

N (u(2)2 , σ2

U )

· · ·X

(2)3

= = · · ·A, c

Figure 6.5: Factor graph of an SSM with two independent sets of in-put and output observations. The factor node fS denotes ahard constraint that is equal to 1 iff the two state variablesand the input and output variable fulfill (2.1) and (2.2).The dotted part of the graph is relevant in ML-based es-timation. In EM-based estimation, only EM messages arepropagated on the dotted edges; the two SSMs are inde-pendent otherwise.

Consider the factor graph in Figure 6.5, which represents an SSM giventwo sets of input and output observations. The node fS represents a hardconstraint to ensure that the two state variables X(`)

k−1 and X(`)k and the

input and output variables fulfill the SSM equations (2.1) and (2.2). Thetwo SSMs, which are assumed to be parametrized in AR form, are onlyconnected through the parameters A and c, which of course may be seenfrom the conditional independence of the likelihoods

p(y(1), . . . ,y(L),u(1), . . . ,u(L)|A, c) =L∏`=1

p(y(`),u(`)|A, c). (6.8)

EM-based system identification in the joint multiple measurements fac-tor graph as in e.g., Figure 6.5 or the joint likelihood (6.8), can be ac-commodated into Algorithm A.3 on page 128 with the following two

Page 89: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

74 Input Estimation for Force Sensors

adaptations:

• Computation of the marginal probability densities using Gaussianmessage passing in SSMs is done separately on each of the L inputu(`) and output y(`) datasets. On each SSM factor graph and foreach time step, the EM messages ←−µ (`)

A (A) and ←−µ (`)c (c) are calcu-

lated independently. These are steps 2) and 3) in Algorithm A.3.

• The EM messages are then joined through equality constraints (seee.g., the dotted part of the graph in Figure 6.5). Correspondinglyin Algorithm A.3, EM messages from all SSMs must be combinedprior to step 4).

We recognize that the EM-based system identification method for mul-tiple observations averages the estimated parameters over multiple mea-surement sets and is generally different from averaging all datasets beforesystem identification. The proposed approach is preferable to the latterone, which is very common in system identification, as it is an (approx-imate) algorithm to find the joint ML estimate (cf. (6.8)).

Decimation

As seen in Section 3.2, decimating signals before using Gaussian messagepassing-based methods can be a simple method to improve numerical sta-bility when a slowly-varying band-limited SSM is used. Implementationof decimation of the input-output measurements before application ofthe EM algorithm is straightforward. As the estimated SSM is going tobe defined for a lower rate, it must then be up-sampled after estimation.Let n be the decimation factor. We propose to up-sample the SSM bycomputing A1/n and with the upsampled A and the original measure-ment data, reestimate c, while keeping A fixed.Before decimation, it is advantageous to lowpass filter the data, in or-der to reduce noise. However, applying a lowpass filter may lead toidentification of a differently shaped SSM. To overcome this drawback,we propose system identification with n cosets of the input and outputsignals, resulting from decimation by a factor n. Specifically, define

u(`,k) = u(`)k , u

(`)n+k, u

(`)2n+k, . . .

Page 90: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.3 Sensor Model Identification 75

for any measurement set ` and k ∈ [1, n]. The cosets are defined anal-ogously for y. EM-based estimation as in the case of multiple mea-surements, where the cosets represent a measurement can now be used.Analogously to the factor graph shown for the case of 2 sets in Fig-ure 6.5, the (decimated) SSMs for each coset independently computemarginal probabilities and EM messages for A1/n and c. The EM mes-sages of each SSM with dataset u(s,k) and y(s,k) are then combined inthe M step by means of the equality constraints. When the EM-basedmethod terminates, the joint low-rate estimates of the SSM parametersare up-sampled to obtain normal-rate model estimates.

Constrained Model Identification

Assume that constraints on the sensor model are given that can be ex-pressed as linear functions of the SSM parameters A or c. In fact, fordynamometer sensors, it is known that the steady-state gain must be 1,i.e., the sensors are calibrated. It was observed that identification mea-surements are often corrupted4 with constant offsets and gains unequalto 1.A linear constraint can be directly integrated into the EM recursionsfor model identification. The key observation is that in step 4) in Al-gorithm A.3 (i.e., where all Gaussian EM messages are combined withequality factors) a linear constraint on the parameter corresponds to anoise-free observation factor.Specifically, consider an unit gain constraint on the estimated SSM.Given A in AR form, where a is the first row of A, the constraintis

c1 = 1− a1. (6.9)

This can readily be seen from the transfer function derived from an SSMin AR form

S(z) = c1zn−1 + c2z

n−2 + . . .+ cnzn − a1zn−1 − . . .− an

,

with steady-state gain given by S(z)|z=1 = 1 expression (6.9) follows.From standard Gaussian message passing rules as in [47], it can be seen

4These effects can most likely be attributed to drift effects in the sensor’s ampli-fier.

Page 91: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

76 Input Estimation for Force Sensors

that (A.5) is replaced with

λ = 1− 1T(a[j] +mc)1TW−1

c 1and

c[j+1] = W−1c (Wcmc + λ1)

with 1 a column vector with all entries equal to 1.

6.4 Frequency-Based MMSE Filters

The resonating modes that occur in actual dynamometer sensors oftenexert dynamic forces in multiple spatial direction. To account for spatialcharacteristics of resonating modes and hence, improved suppression ofresonating modes, a frequency-based MMSE filter is presented next.Let the actual force signal, measurements and estimate be denoted as:

U(t) =

U(x)(t)

U (y)(t)U (z)(t)

, Y (t) =

Y(x)(t)

Y (y)(t)Y (z)(t)

, U(t) =

U (x)(t)U (y)(t)U (z)(t)

.

(6.10)A consequence of these phenomena is crosstalk in the 3-dimensional sen-sor measurements e.g., a dynamic force u(x)(t) in just one spatial direc-tion acting on the sensor will be recorded on other channels Y (y) and/orY (z) of the sensor as well. There are two ways to approach this issue:considering the interference from other channels as noise and suppressingit or joint estimators of all 3-dimensional channels.Single-channel Wiener filters were presented in Section 5.3 for LMMSEinput-estimation in a wide-sense stationary setting. These filters are ap-propriate if we assume that between channels there is no interference.Otherwise, Wiener filters for (correlated) multi-channel signals are nec-essary. Both approaches i.e., independent single-channel Wiener filtersand a multi-channel Wiener filter will be shown below and comparedwith performance on experimental data.

6.4.1 Frequency-Based Filtering

To develop the frequency-based MMSE filter, we will take different as-sumptions to Section 6.2.1. Specifically, we assume that the sensor is

Page 92: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.4 Frequency-Based MMSE Filters 77

linear and time invariant and that noise or external disturbances arestationary. Then the sensor, sampled with sampling period T , can bemodeled with the discrete-time Fourier transform (DTFT) at frequencyθ = 2πTf as Y (x)(ejθ)

Y (y)(ejθ)Y (z)(ejθ)

︸ ︷︷ ︸

Y (ejθ)

=

H(xx)(ejθ) H(xy)(ejθ) H(xz)(ejθ)H(yx)(ejθ) H(yy)(ejθ) H(yz)(ejθ)H(zx)(ejθ) H(zy)(ejθ) H(zz)(ejθ)

︸ ︷︷ ︸

H(ejθ)

U (x)(ejθ)U (y)(ejθ)U (z)(ejθ)

︸ ︷︷ ︸

U(ejθ)

. (6.11)

Here H(ab)(ejθ) represents the DTFT of the impulse response of an inputin direction a ∈ x, y, z measured at the sensor’s output ba ∈ x, y, z. Fur-thermore, let the input force U be modeled as a wide-sense stationarywhite Gaussian5 noise process with power spectral density SUUT(ejθ) ≡σ2uI (i.e., we refrain from making any assumptions on the input force).

The observed sensor measurements Y are corrupted by wide-sense sta-tionary noise N with power spectral density σ2

n(ejθ)I.As described in Section 2.3, the multi-dimensional Wiener filter for theproblem at hand is

G(ejθ) = SUY T(ejθ)S−1Y Y T(ejθ)

= σ2uH(ejθ)H (σ2

uH(ejθ)HH(ejθ) + σ2n(ejθ)I

)−1. (6.12)

Under certain circumstances (e.g., inaccurate sensor model estimates orlimited computational resources) it might be desirable to make indepen-dent MMSE estimates instead of one joint estimate of all three signaldimensions. To this end, consider the scalar Wiener filter correspondingto (6.12) for scalar signal, i.e.,

G(a)(ejθ) = SY (a)U(a)(ejθ)SY (a)Y (a)(ejθ)

5If the Gaussian assumption does not hold, all filters reduce to the correspondingLMMSE filters [39].

Page 93: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

78 Input Estimation for Force Sensors

= σ2uH

(aa)(ejθ)σ2uh

(a)(ejθ)h(a)(ejθ)T + σ2n(ejθ)

(6.13)

for a ∈ {x, y, z} and with H(aa)(ejθ) the complex conjugate and h(a)(ejθ)the row vector a of H(ejθ). This interference suppression filter will bedenoted with K(a)(ejθ). Most of the following treatment applies to thisfilter as well as the MMSE filter, however, we will only refer to the MMSEfilter.Of course, ignoring all interference one obtains the well-known scalarWiener filter for all channels a ∈ {x, y, z} and all frequencies:

L(a)(ejθ) = σ2uH

(a)(ejθ)σ2u|H(a)(ejθ)|2 + σ2

n(ejθ). (6.14)

6.4.2 Frequency Response Function Estimation

In order to use the derived filters in a practical dynamometer filtering ap-plication, at first it is necessary to estimate the multichannel frequency-domain model H(ejθ) for a sensor. To this end, we assume that L ≤ 1impulse response measurements are available as in the identification set-ting outlined in Section 6.1.

Sensor transfer function identification To apply the frequency-based system identification method proposed next, the identificationmeasurements are chosen long enough such that the output signals ofthe sensor are zero outside the given window of duration TW . Note thatthis the SSM-based identification method in Section 6.3 did not havethis prerequisite.Let K = TW /T and define6

U = {U1[·],U2[·], . . . ,UL[·]}Y = {Y1[·],Y2[·], . . . ,YL[·]},

the sets of the K-point discrete Fourier transform (DFT) of the inputdata sets and the output data sets.Under the assumption that the measurements Y are corrupted by whiteGaussian noise with variance σ2

E and recalling well-known properties of6Note that a different notation for the multiple measurement sets is used than

in Section 6.3 to avoid confusion with the spatial indices x, y, and z.

Page 94: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.4 Frequency-Based MMSE Filters 79

the DFT (i.e., white noise remains uncorrelated after taking a DFT),the identification problem setup is

Y`[n] = H[n]U`[n] +E`[n],

for all Uk ∈ U , Yk ∈ Y and n ∈ [1, . . . ,K]. Due to the linear Gaussianproblem, the ML estimate of the unknown frequency response H[·] is thesolution of 3L independent least-squares problems. To this end, definefor all K DFT points

h(a)[n] = [H(ax)[n], H(ay)[n], H(az)[n]]T,

i.e., the transfer functions with output channel a, where a ∈ {x, y, z}.The least-squares problems that give the ML estimate of the unknownsensor model can then be written more compactly as

h(a)[n] ,

argminh(a)[n]

∥∥∥∥∥∥∥∥

Y(a)1 [n]...

Y(a)L [n]

U(x)1 [n] U

(y)1 [n] U

(z)1 [n]

...U

(x)L [n] U

(y)L [n] U

(z)L [n]

h(a)[n]

∥∥∥∥∥∥∥∥2

(6.15)

for all a ∈ {x, y, z}. The least-squares solution is [45]

h(a)[n] = (L∑`=1

u`[n]Hu`[n])−1(L∑

`′=1u`′ [n]Hy(a)

`′ ).

Accidentally, this is equivalent to the (estimate of) cross-spectral densityof the input and output signal divided by the estimated power spectraldensity for the input signal:

Hab[n] =

1L

L∑=1Y

(a)` [n] · U (b)

` [n]

1L

L∑=1|U (b)` [n]|2

=

1L

L∑=1SY

(a)`

U(b)`

[n]

1L

L∑=1SU

(b)`U

(b)`

[n]. (6.16)

Corrections of the Frequency-Response Function In the appli-cation at hand, the sensor characteristic at very low frequencies is impor-tant for filtering performance, because typical input force signals display

Page 95: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

80 Input Estimation for Force Sensors

very slow and large signal components (e.g., offsets). Corrupted esti-mates of the frequency-response function (FRF), caused by non-idealitiesin the sensor-amplifier chain or noise, are remedied by linearly interpo-lating the first 10 FRF bins setting the first bin to 1.Further improvement in performance is observed, when the resultingWiener filter has a steady-state (i.e., n = 0) gain of 1. We thereforerescale (6.12) according to

G[n] =(

I + σ2n(0)σ2u

H[0]HH[0])

H[n]H(

H[n]HH[n] + σ2n[n]σ2u

I)−1

(6.17)and also (6.13)–(6.14) in an analogous way.

6.4.3 Results

The three filters (6.12)–(6.14) are evaluated with measurements from[EXPA] (cf. Appendix D). The frequency domain model is estimatedfrom 3 impulse response measurements per spatial axis (x,y, and z).Since it is assumed that identification measurements are zero outside ofthe measurement window, the DFT of the zero padded impulse responsesis used to approximate the DTFT. The zero-padding length is chosensuch that impulse responses, their DFT transform, and Wiener filter allhave the same number of DFT bins as the corresponding signal segment7.The parameter σ2

n/σ2u was determined in a coarse sweep over the given

data and is set to 0.1. Finally, a low-pass filter (6th-order Butterworthfilter with cut-off frequency 3500Hz) is applied.The results are summarized in Figure 6.6, with the relative MSE im-provement (2.6) taken with respect to the unfiltered sensor measure-ments. The reference dynamometer measurements were filtered withmethod (6.12) (multi-dimensional Wiener filter) and subsequently low-pass filtered before being used as true input signal. The box plot depictsthe lower quartile, median, and upper quartile of the data. The whiskersmark the smallest and largest sample in the set.From the results shown in Figure 6.6 three observations can be made:

7In practice, a fixed-length FIR type Wiener filter as noted in Remark 6.2 wouldbe computed from identification measurements to make the application more flexible.Since performance will be at most as good as in the IIR Wiener filter case, we chosethis filter for presentation here.

Page 96: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.4 Frequency-Based MMSE Filters 81

G(f) G(f) L(f)

0

5

10

15

20

2.92

8.42

16.04

2.17

8.56

16

2.79

8.33

16.21

∆MSE

(a) x axis

G(f) G(f) L(f)

0

5

10

15

20

0.36

7.81

17.32

0.33

7.78

17.99

0.33

7.67

19.08

∆MSE

(b) y axis

G(f) G(f) L(f)

0

5

10

1.02

2.97

7.98

0.96

3.64

9.04

0.93

2.75

6.83

∆MSE

(c) z axis

Figure 6.6: Distribution of MSE improvement (in dB) for a set of 23measurements taken during milling obtained by means ofdifferent frequency-domain filters.

Page 97: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

82 Input Estimation for Force Sensors

• For measurements from the x-axis channel and the y-axis channel,filtering crosstalk with either G(ejθ) or G(ejθ) improves the qualityof the estimate marginally.

• The estimation quality of signals measured on the z channel canbe improved by taking interference from the other channels intoaccount. However, median improvements are again very small, asmostly a few signals can be estimated much better by taking intoaccount the crosstalk.

• Interestingly, merely suppressing the interference with the filterG(ejθ) leads to considerably better results than using the jointestimator G(ejθ) for the z-axis measurements.

6.5 Results

The MSE improvement factor (2.6) with respect to the unfiltered esti-mate, the sensor outputs yk, will be used to compare input estimationperformance across various methods. Specifically for estimate uk andtrue input uk:

∆MSE ,∑Kk=1(uk − uk)2∑Kk=1(yk − uk)2

. (6.18)

We show two different scenarios: First, a setting is shown where rele-vant sensor characteristics are captured accurately with SSMs and sec-ondly, analyze an opposite scenario, where high-order techniques (i.e.,high-order Wiener filter) outperform other methods. Details on the ex-perimental setup are given in Appendix D.

6.5.1 System Identification Results

Single-channel measurements (channel x) are filtered with the proposedmodel-based input estimation filter using a third order spline prior (cf.Appendix C), which provides the flexibility to model continuous-timesignals with adjustable degrees of smoothness8. The noise variance ratio

8As an aside, states in X(U)k

represent estimates of derivatives of U (using theSSM basis in (C.1)). These estimates might be of interest in certain applications,e.g., peak detection.

Page 98: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.5 Results 83

for the input estimation algorithm is set to λ = 10−3 to account formodeling errors as the measurements are nearly noise-free.The models are estimated via different system identification methods:Standard EM with model order9 6, EM using decimated (by a factor2) model (cf. Section 6.3.3) with orders 6 and 8, output-error identifica-tion method [43] with order 8. The latter method essentially finds themost-likely SSM with deterministic inputs and noise at the output (i.e.,model Figure 6.3 with σ2

U = 0 and σ2N > 0). The output-error method

is forced to fit the measurements well in the frequency range from 0Hzto 3000Hz by standard techniques [43]. The EM-based system identifi-cation methods employed parameter σ2

U = 10−5 and λ = 100.The FRF derived from the estimated SSMs are shown in Figure 6.7(the 6th order decimated EM estimate is not shown). The largest dis-crepancy between the estimated FRFs is seen around 600Hz and above2000Hz. Clearly, the 6th order EM estimate can not model the smallmode at 600Hz, but otherwise follows the 8th order decimated EM esti-mate rather closely. Surprisingly, the output-error estimate also doesnot recover the mode at 600Hz. Because output error methods donot minimize error criteria similar related to the relative model error,the penalty incurred by not modeling the mode around 600Hz is small(around −5 dB). The 8th order decimated EM, which minimizes (6.6),on the other hand, puts more weight on fitting errors when magnitude issmall and therefore penalizes the estimate more (under the assumptionthat the parameter λ is chosen small enough, see Section 6.3.1).Let us assess how these differences eventually affect input estimationperformance. Result for all the signals in [EXPB] are presented in Ta-ble 6.1 and additionally the (single-input single-output) Wiener filter(cf. Section 6.4) for comparison. To assess model estimation quality, theMSE figures were obtained by first applying a low-pass filter with cut-offfrequency 2500Hz to the validation signal and the estimates. We observethat the decimated estimates with order 6 and standard EM identify theSSM equally well.The additional mode at 600Hz, appears to have a large impact on inputestimation performance. This is seen from the performance gains of the8th order model compared to all other methods. The reasons are thatthe machining forces have large spectral components in this frequency

9Higher model orders cannot be computed numerically stable with standard mes-sage passing.

Page 99: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

84 Input Estimation for Force Sensors

−20

−10

0

10

20

30

40

50

Magnitude

[dB]

measurements8th order decimated EM8th order output error6th order EM

0 500 1,000 1,500 2,000 2,500 3,000−180

−90

0

90

f [Hz]

Phase[deg]

Figure 6.7: Empirical frequency response from measurements [EXPB]and frequency responses derived from various identificationmethods. Also marked with dashed lines are inferred polefrequencies of the dynamometer-machine system.

Page 100: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.5 Results 85

∆MSE in dB

Model order i ii iii

Standard EM 6 9.4 8.7 10.6

Decimated EM6 10.8 9.6 11.48 14.8 14.6 15.4

Output-error method 8 10.6 9.4 11.4Wiener filter n.a. 8.4 8.2 8.9

Table 6.1: Improvement in MSE compared to the unprocessed sensoroutput after applying various filter methods. Figures werecomputed with experimental data from [EXPB], which con-tains three signal sets i, ii, and iii.

band and that, when magnitude of the transfer function is small, modelerrors have a greater effect on input estimates than in the opposite case.Performance of the Wiener-filter estimate is, interestingly, slightly lowerthan for model-based methods. This effect has also been observed withother datasets and is due to discrepancies between the system’s charac-teristics during identification measurements and the characteristics dur-ing machining and processing. The model-based estimate is more robustto these kind of changes as it tries to infer a low-order model that evenafter small changes to the FRF still fits well.

6.5.2 Force Estimation Performance

In Table 6.1 it was already observed that the presented input estima-tion method can improve performance considerably compared to spec-tral inverse methods such as the Wiener filter. Stability of the sensor’scharacteristic plays a key role in determining estimation performance.While, the proposed model-based approach appears more robust thanthe Wiener filter, the prerequisite is that relevant sensor characteristicsare represented well by the model. Often a simple SSM of order 2 lim-its performance, as in Table 6.1, where at least models of order 8 arenecessary to achieve the best performance.Let us analyze the last requirement. The same algorithm and settings

Page 101: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

86 Input Estimation for Force Sensors

∆MSE in dB

Model order Avg. Min. Max.

Standard EM 6 9.9 7 13.3Decimated EM 6 9 4.7 12.5Wiener filter n.a. 11 5.9 18.8

Table 6.2: Improvement in MSE compared to the unprocessed sensoroutput after applying various filter methods. Figures werecomputed with experimental data from [EXPA] and aver-aged over 17 signals.

as above are applied to data from another experiment ([EXPA], cf. Ap-pendix D). The results are shown in Table 6.2. For this experimentaldata, the sensor’s transfer characteristics can not be captured well by alow-order SSM (see Figure D.1) and the Wiener filter outperforms bothlow-order methods.

6.5.3 Convergence Properties

The EM algorithm is well-known to exhibit slow convergence under cer-tain circumstances (cf. Section 2.2). Convergence results for the pro-posed EM method using data from [EXPB] are shown in Figure 6.9.The convergence of the estimated frequency of the three largest modestowards the inferred pole frequencies marked in Figure 6.7 is shown.For the subsequent analysis a conjugate gradient-based optimizer of thelog-likelihood was implemented (cf. Section 2.2.2). In each conjugategradient step, the optimal step size is found by line-search methods.Observe that the strength of the mode mainly determines convergencespeed of the corresponding estimated frequency (resonance mode is lo-cated at around 1800Hz and a strong resonance anti-resonance pair ataround 200Hz). While convergence is fast for the strongest modes, esti-mation of weaker poles requires a lot EM steps.Empirically, we observed “zig-zag” steps as illustrated in Figure 6.8. Thiseffect arises because EM updates for A and c factorize into independentupdates (cf. Section 2.2.4) with Xk as hidden variables; The EM is

Page 102: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.5 Results 87

A

c

Figure 6.8: Schematic illustration of a typical issue encountered whenestimating models with multiple resonance/anti-resonancemodes; Poorly conditioned likelihood (i.e., high conditionnumber of the local Hessian of the likelihood) and decou-pled optimization steps on c and A (responsible for “zig-zag” steps), cause very slow convergence of the algorithm(black path). The unfavorable conditioning of the problemmanifests itself graphically in long and thin level sets ofthe likelihood.

not able to make joint updates, which would correspond to diagonalmoves in Figure 6.8. This issue is more prominent for poorly conditionedproblems, because their likelihood (locally) exhibits long and thin levelsets as in Figure 6.8.But, this decoupling is not the only issue, as the conjugate gradientoptimizer, which is known to overcome such “zig-zag” behavior [11], doesnot converge to the difficult to reach mode. The other issue appearsto be the high condition number of the Hessian of the log-likelihood,which corresponds approximately to W in the EM [48]. It can be shownthat convergence rate of first order methods, such as conjugate gradient,is proportional to that condition number, however, the EM method isknown to (partly) overcome this limitation [83].In fact, when initializing EM with either the EM solution of A or c andusing real measurement data, convergence occurs in roughly up to 10iterations.

Page 103: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

88 Input Estimation for Force Sensors

0 10 20 30 40 50

200

300

400

Iteration

Firstpo

lefreque

ncy[H

z] standard EMdecimated EMconjugate gradient

0 100 200 300 400 500 600 700 800500

600

700

800

Iteration

Second

pole

freque

ncy[H

z]

0 2 4 6 8 10 12 14 16 18 201,000

1,200

1,400

1,600

1,800

Iteration

Third

pole

freque

ncy[H

z]

Figure 6.9: Current pole frequencies estimate versus iteration for sim-ulated identification data 6th order. The true frequenciesare marked with a thin line.

Page 104: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

6.6 Conclusion 89

6.6 Conclusion

We have presented two solutions to improve force measurements fromdynamometer sensor measurements during machining processes. First,we have introduced a model-based approach combining discrete-time in-put estimation techniques proposed by Bolliger et. al. and input signalpriors. We have presented a probabilistic variational model for systemidentification, the pivotal step in this application, and have showed thatit renders better model-based estimators with ML system identificationthan general state-of-the-art system identification techniques. To over-come numerical issues that severely limit applicability of EM-based esti-mation methods, a novel constrained EM algorithm in a decimated SSMwas proposed. Secondly, frequency-domain approaches that take intoaccount the three-dimensional nature of the sensor readings, have beendevised.Practicality of the methods has been corroborated with experimentalresults on real machining measurement data. Superior performance ofthe variational probabilistic model for system identification has also beenconfirmed by real-world data.While the model-based method assumes low-order systems and estima-tors, the frequency-domain approach, implicitly relies on high-order sys-tem models. As we have demonstrated with different experimental datasets, both approaches have a distinct raison d’être: In settings wheresensors have simple transfer functions, the model-based estimator hasoutperformed the Wiener filter and, in addition, compensation performswell on both stationary as well as non-stationary signals due to time-domain processing. Otherwise, the Wiener-filter has exhibited superiorperformance and is able to jointly process multi-dimensional signals.Two interesting future directions arise: On one hand, the positive resultsof the variational probabilistic model encourage to consider other model-based estimation tasks and develop techniques that do not necessarilyrender the most accurate model estimates, but rather the model thatyields the best performance on the designated task. On the other hand,prior knowledge on the cutting forces may be used to develop morereliable, maybe even semi-blind, model identification methods.

Page 105: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 106: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 7

Sparse Bayesian Learningin Factor Graphs

In this chapter, we show probabilistic models and algorithms that com-bine sparse methods with Gaussian message passing. A special focusis on the Bayesian approach, which allows to quantify “confidence” and“goodness-of-fit” of models or estimates. These measures prove to beessential for approaching certain blind estimation problems e.g., blinddeconvolution [41]. Concepts from this chapter are specialized to linearSSMs in Chapter 8.We first introduce sparsity-promoting priors with variational represen-tations (Section 7.1) and then integrate these priors into general factorgraph that express Gaussian functions and show estimation methods(Section 7.2). Eventually, we focus on fast methods and show that ahighly efficient method to recover sparse estimates may be derived di-rectly from message passing and the tools in Section 7.2 and we showa new efficient analogue of this algorithm for multi-dimensional sparsefeatures (Section 7.4).

7.1 Variational Prior Representation

Heavy tails of the pdf and non-differentiability at zero are commonlyconsidered to induce sparsity [31, 37]. The latter is typically responsi-ble for recovering exact zeros when used as a prior in an appropriateestimator (e.g., MAP estimator) [66]. However, in practical applica-tions, signals are typically not exactly sparse (i.e., with zeros), but only

91

Page 107: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

92 Sparse Bayesian Learning in Factor Graphs

approximately. This notion has been formalized and particularized tocompressible1 priors [13,31]. Roughly, a sequence of iid random variablesis compressible, if (infinitely long) realizations are always approximatedwell (in quadratic distance) with fewer large entries.The following particular class of symmetric probability densities definedas follows allows us eventually to obtain tractable (approximate) statis-tical models for compressible priors:

Definition 7.1: Super-Gaussian Probabilities [52]A symmetric pdf p(x) is strongly super-Gaussian if p(

√x) is log-concave

on (0,∞).

Observe that the strongly super-Gaussian property implies heavy tails,which are key to weakly-sparse realizations [31] with the correspond-ing probability distribution. In addition, strongly super-Gaussian pdfsalways admit a specific kind of variational representation [52]; eachstrongly super-Gaussian pdf p(x) ≡ e−g(x2) may be represented as

p(x) = supγ>0

1√2πγ

e−x22γ︸ ︷︷ ︸

,N (x|0,γ)

φ(γ−1), (7.1)

with

φ(α) =√

2παeg?(α/2)

and g? the concave conjugate of g (see, e.g., [11]). In the sequel we define

f(γ) , φ(γ−1). (7.2)

Example 7.1: Variational representation of Student’s t distri-butionThe Student’s t distribution with degree of freedom ν > 0 has pdf

p(x) ∝ (1 + x2

ν)−

ν+12 , (7.3)

which is easily seen to be indeed strongly super-Gaussian. From theconcave conjugate of (7.3), the variational representation follows

p(x) = supγ>0

Kν N (x|0, γ)γ−ν/2e−ν/2γ

1This term should not be confused with the informmation-theoretic concept ofcompression that relates to the entropy of a probability density.

Page 108: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.2 Sparse Bayesian Learning in Factor Graphs 93

with a constant Kν , 2π(

2νν+1

)ν+1eν+1. The weight on γ thus corre-

sponds to an inverse Gamma pdf.

The Student’s t distribution is seen to be a member of a larger classof distributions that are provably compressible with suitable parameters(ν < 3 for Student’s t) [31]. Compressibility means that iid realizationsfollow almost surely a power-law decay, which is a way to formalize theproperty weakly sparse. Using such distributions may lead to more con-sistent Bayesian interpretations (e.g., a-posteriori mean) from the sta-tistical model [30], which is of special interest when model parametersmust also be estimated, as e.g., for the proposed blind input estimationmethod or experimental design tasks [37]. This generative view is differ-ent from the common “optimization view”: Sparsifying regularizers areinterpreted as maximum a-posteriori estimates with a given prior (e.g.,the LASSO problem is commonly linked to a maximum a-posteriori es-timation with an iid Laplacian prior).

7.1.1 Multi-Dimensional Features

Akin to the group Lasso in [84], the representation of priors in Fig-ure 7.1 a) may be extended to multidimensional features U ∈ Rd withd > 1 by adapting (7.1) to

p(u) = supγ>0N (u|0, γI)f(γ). (7.4)

Results are analogous to the scalar case.The prior defined by (7.4) is useful to describe vectorial features. Partic-ular application examples include multi-input linear SSMs or glue factor2

learning [59].

7.2 Sparse Bayesian Learning in Factor Graphs

Assume that U = U1, . . . , UL ∈ R are iid super-Gaussian priors withprobability density p(u) =

∏Lk=1 p(uk) and a Gaussian3 likelihood func-

2The glue factor is a special factor augmenting factor graphs of SSMs. Amongits many applications, its ability to localize a pulses is particularly interesting in thiscontext.

3Non-Gaussian likelihoods may be treated by linearization (e.g., EKF) or localvariational approximation [78] (e.g., sigmoid for classification).

Page 109: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

94 Sparse Bayesian Learning in Factor Graphs

tion p(y|u) that admits representation with a factor graph [40]. Alsolet p(uk|γk) = N (uk|0, γk) for all k ∈ [1, L] assume that the variationalweights f(γk) are continuous. In later parts, we will particularize p(y|u)such that it corresponds to a linear SSM, but most of the current expo-sition extends beyond that.Let uk denote an estimate of uk that follows from a standard Bayesiantechnique. One possibility to obtain such technique is a MAP estimate

u = argmaxu

p(y|u)L∏k=1

p(uk), (7.5)

called Type-I methods [79]. Now, using the variational representation forall p(uk) an interesting expression for the MAP estimate emerges:

uI , argmaxu

p(y|u)L∏k=1

maxγk

p(uk|γk)f(γk)

= argmaxu

maxγ

p(y|u)p(u|γ)L∏k=1

f(γk). (7.6)

The maximization in (7.6) can readily be seen as a (non-trivial) instanceof the idea of opening and closing boxes [40]. Indeed, as illustratedin Figure 7.1, using the variational representation of sparse priors andthe graphical model representation of p(y|u). Type-I methods are po-tentially equivalent to max-product message passing in the joint graphin Figure 7.1.Alternatively to MAP, finding a posterior distribution that capture in-ferred characteristics of u, i.e. p(u|y), is generally intractable. However,the factor graph in Figure 7.1 a) represents by definition a (scaled) Gaus-sian distribution for γ fixed. Assume that all γk can be set to a fixedvalue and the resulting Gaussian distribution approximates well the in-tractable one. Yet, one criterion to choose an approximate distributionis by maximizing the evidence p(y) [67, 80] since

p(y) =∫p(y|u)p(u)du

=∫p(y|u)

L∏k=1

maxγk

p(uk|γk)f(γk)duk

≥ maxγ1,...,γL

∫p(y|u)

L∏k=1

p(uk|γk)f(γk)duk, (7.7)

Page 110: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.2 Sparse Bayesian Learning in Factor Graphs 95

where the last expression is akin to a ML problem over γ with latentvariables u. When an approximation captures much of the evidence, i.e.,is close to p(y), the maximizing (Gaussian) distribution may be used asa tractable surrogate for p(u|y). This method, corresponding to sum-product message passing on all edges in the graph except γ, is known asa Type-II method [12]. Let us define

γII , argmaxγ1,...,γL

∫p(y|u)

L∏k=1

p(uk|γk)f(γk)duk, (7.8)

and akin to (7.6) the mode, or equally the mean, as

uII , argmaxu

p(u|y, γII). (7.9)

For undetermined sparse recovery it was argued in [80] that bound (7.7)is in fact sufficiently tight to obtain sparse uII.The relation between Type-I and Type-II estimates for a linear model

y = Θu+ n, (7.10)

where Θ ∈ Rm×L, u ∈ RL, y ∈ Rm, and n ∼ N (0, σ2I) can be uncov-ered through a variational characterization of the estimates [79]. Thisvariational expression can, alternatively to the derivation in [79], be read-ily obtained with Lemma B.2 as follows: Observe that the logarithm ofthe integral over u in (7.8) can be identified with Lemma B.2 and hence,

γII, uII = argmaxγ,u

log∣∣WY

∣∣/2+ q(y,u) +

L∑k=1

p(uk|γk) + log f(γk)

= −2 argminγ,u

log∣∣ΘΓΘT + σ2I

∣∣+ σ−2‖y −Θu‖2 +

L∑k=1

u2k

γk− 2 log f(γk), (7.11)

where by (7.10), q(y,u) is the quadratic form corresponding to theGaussian likelihood p(y|u) = eq(y,u) or the factor graph representationthereof.Since we are interested in a-posteriori statistics rather than MAP pointestimates, the subsequent exposition focuses on Type-II estimation and

Page 111: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

96 Sparse Bayesian Learning in Factor Graphs

Figure 7.1: Figure a) is a schematic of the “open box” principle; astrongly super-Gaussian pdf p(u) is expanded into a maxi-mization factor. Figure b) shows the convex-type represen-tation (max) of the Laplace pdf. In this case f(γ) = λγ.

computation methods. It is noteworthy that computing Type-I esti-mates is generally more straightforward than computing Type-II es-timates. Since (7.6) is an (nonlinear) optimization problem, a vari-ety of optimization methods are applicable; One example is alternatingminimization, which corresponds to iteratively reweighted least-squarsesapproaches [14], when f(γ) ≡ 1. Similarly, using the strongly super-Gaussian prior p(x) = e−|x|

τ results in the weighting used in the sparserecovery algorithm from [15]. With an additional regularization termthat inhibits entries in γ from going to 0 too fast, these methods deliververy good performance recovering sparse solutions. However, the addi-tional damping term is an additional parameter that needs to be chosencarefully in practical applications [76] and it has been reported to havemany local minima, where optimization algorithms can get stuck [62].

7.3 Multiplier Optimization

In the following and if not stated otherwise, we assume that f(γ) ≡ 1. Asseen above, to get sparse estimates maximization of the likelihood withrespect to the multiplication factor4 γ is necessary. To this end, differ-

4It is seen that the multiplier ξ only enters through its square in the likelihood.We therefore, use γ = ξ2 throughout the presentation and in the factor graph repre-sentation.

Page 112: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.3 Multiplier Optimization 97

ent update rules are given in Table 7.1. To distinguish between othermaximum-likelihood methods, these updates will be denoted multiplierupdate steps. Because the update rules in Table 7.1 boil down to localcomputations (i.e., message passing), one may envision schemes for muchmore complex models, by building upon Gaussian message passing re-lations and multiplier optimization rules. Derivations of the expressionsin Table 7.1 are deferred to Appendix B.5 on page 142.In the following, a few observations are due. Firstly, when there aremultiple (connected) features U , the EM-based updates and MacKayupdates can be executed independently, i.e., in parallel on every edgeγ. Contrary, the marginal likelihood maximization (V.6) updates arederived under the assumption that all multipliers γ, except the updatedone, are fixed.When f(γ) is not constant, a closed-form expression for f(γ) given asuper-Gaussian prior pdf p(x) may be found through convex duality(e.g., Student’s t distribution in Example 7.1). However, for generalp(x), hence general f(γ), only the EM multiplier update yields closed-form updates - the other multiplier updates can usually not be expressedin closed-form.

Convergence Consider (7.5) and an uninformative weight f(γ) ≡ 1.Let

L(γ) ,∫p(y|u)

L∏k=1N (uk|0, γk)du.

be the log-likelihood.Convergence to a local maxima5 is guaranteed for (V.6). The MacKayupdate rule, which is a gradient method, does not necessarily convergeand the EM update may converge to saddle points. Contrary to (V.6),the EM update cannot reintroduce features that have been turned off,i.e., γ set to 0.A further connection between the update expressions, which eventuallyshed some light onto their convergence properties is obtained with (2.11)and (B.23). It is easily seen that the gradient of the log-likelihood is

d`(γ)dγ

= WX −(WXµX

)2.

5As the optimization problem over multiple γ is non-convex this is as good as wecan hope for.

Page 113: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

98 Sparse Bayesian Learning in Factor Graphs

Node Update rule

N (0, 1)

×f

γ

X

max

• EM (M step):γ ← m2

X + VX (V.1)or

γ ← γ − γ2(WX −

(WXµX

)2). (V.2)

If f(γ) is not necessarily equal to 1 everywhere,but is chosen s.t. the max box correspondsto p(x) then [52]:

γ ←(− 1x

d log p(x)dx

)−1/2∣∣∣∣∣x=√m2X

+VX

.

(V.3)

• MacKay update [67]:

γ ← m2X

1− VX/γ(V.4)

or

γ ←(WXµX

)2WX

γ. (V.5)

• Marginal likelihood maximization updates:γ ← argmax

γp(y|γ)

or

γ ←

√←−m2Xn−←−VXn ←−m2

Xn≥←−VXn

0 otherwise,(V.6)

Table 7.1: Overview of multiplier update rules for feature selection.

Page 114: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.4 Fast Sparse Bayesian Learning 99

Looking at (V.2), we recognize that the EM update is in fact a gradientstep with variable step size γ2

γ ← γ − γ2 d`(γ)dγ

on the likelihood `(γ). Since the step size goes to zero faster than theestimate γ, it is evident that EM updates show slow convergence forγ small, i.e., when a feature is almost pruned from the model. It isevident from (V.5) that this effect does not occur for MacKay updates.It appears that this behavior, close to the zeros, plus specific initialvalues for γ has strongly contributed to experimental findings and thegeneral belief that convergence of EM is “slow”.

Accelerated Updates Convergence of the EM updates and MacKayupdates can be accelerated with a simple idea that was used in [64] forvariational parameter estimation and for EM updates with a Gammaprior in [56]. Messages ←−VX and ←−mX are held constant, as they are in-dependent of γ, and we treat the updates on γ as fixed-point equations;the updates are iterated several times. It can be seen that when doingso, the fixed point of the update equations corresponds to (V.6). Thisresult was also recognized in [56,64].

7.4 Fast Sparse Bayesian Learning

The fast sparse Bayesian learning (fast SBL) algorithm presented in [20]combines low computational requirements with good marginal likelihoodoptimization results. The appeal of this method lies in closed-form up-date computations and large computational savings due to the relativelylow dimensions of the involved matrices.The algorithm optimizes the marginal likelihood of a single feature Unat a time. As shown in [20], maximization can be expressed in closedform and shows a threshold behavior, removing features from the modelin a single iteration. Various extensions of this algorithm have beendeveloped: In [5] a hierarchical prior is added to the features’ variances(corresponding to priors on γn in Figure 7.2), which results in adaptivefeature estimation thresholds (cf. (7.13)), while in [64] variational mean-field methods are applied.

Page 115: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

100 Sparse Bayesian Learning in Factor Graphs

+

θ1

×

U1

fγ1

N (0, 1)

θ2

...U2

. . .

. . .

+

θN

×

Un

fγN

N (0, 1)

+Y

N (0, σ2NI)

y

Figure 7.2: Factor graph of a general SBL setup expressed as a recur-sive least-squares graph.

We show that the fast SBL algorithm follows (if it exists) from a recursiveleast-squares decomposition of our general likelihood (cf. Section 7.2) incombination with W-based message representations. Specifically, con-sider the factor graph depicted in Figure 7.2, which has been decom-posed into its features U1, . . . , UN and where Y is an M -dimensionalobservation vector.Initially, all features are pruned from the model i.e., γn = 0 for all n ∈[1, N ]. The pivotal quantities, messages WY and WY µY , are initializedas in

WY =(−→VY +←−VY

)−1WY µY = WY

(←−mY −−→mY

)= σ−2

N I = σ−2N y. (7.12)

Now, a single feature Un is iteratively updated such that the marginallikelihood p(y|γ\n, γn) is maximized, where γ\n denotes the vector ofall γk except γn. Using (7.13),

γn ={←−m2Un−←−VUn ←−m2

Un≥←−VUn

0 otherwise.(7.13)

Between updates (7.13), messages can be efficiently updated by messagepassing. The complete iterative algorithm follows from rules presentedin Chapter 4 then:

Page 116: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.4 Fast Sparse Bayesian Learning 101

Algorithm 7.1: Fast Feature SelectionInitialize WY as in (7.12) and γn = 0 for all n ∈ [1, N ]. Then thefollowing steps are iterated until convergence of γ.

1) Select a feature Un for updating according to a scheduling scheme.Commonly, the next features are selected in a round-robin fash-ion or such that marginal log-likelihood increments are maximized(greedy scheme). In the latter case, additional computation arenecessary as WUn and WUnµUn for all i ∈ [1, N ] must be evaluatedfrom which the marginal log-likelihood increment can be obtainedeasily.

2) Compute ←−VUn and ←−mUn , from WUn = θTnWY θn and WUnµUn =

θTnWY µY , which follow from (4.1) and (4.4),

←−VUn +−→VUn = 1

WUn

←−mUn −−→mUn = WUnµUnWUn

←−VUn = 1

θTnWY θn

− γ2n

←−mUn = WUnµUnθTnWY θn

. (7.14)

3) Set ∆γn = γn − γn and then update γn to γn according to (7.13).

4) Update WY and WY µY . To this end, consider the change of themessage that enters the “Σ” node. Straightforward use of the ma-trix inversion lemma [57] yields a rank-1 updates for WY :(

W−1Y −∆γnθnθ

Tn

)−1 = WY −WY θnθ

TnWY

θTnWY θn −∆−1

γn

,

while WY µY is then obtained by multiplication of WY and y, since−→mUn = 0 for all n ∈ [1, N ].

Remark 7.1It can be easily seen that the variables WUn and WUnµUn correspondto the variables Sn and Qn, which are used in the original fast SBLalgorithm in [20].

Implementation Aspects The graph is cycle free and linear Gaus-sian, thus all computed marginals are exact. Step 2) in Algorithm 7.1requires O(3M2) multiplications (3 matrix-vector products of dimen-sion M) and updating WY in step 4) O(M2) multiplications (1 vector

Page 117: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

102 Sparse Bayesian Learning in Factor Graphs

outer product of dimension M). For M very large and very sparse fea-tures, computational complexity may be reduced by using VX insteadof WY , as rows and columns of VX corresponding to indices with γ = 0are zero as well. Thus, all matrix operations can be reduced to L dimen-sional operations with L the number of non-zero features. Variables WX

can then be recovered with (4.1). Of course, the scheduling scheme instep 1) might be significantly more complex. Complexity can be furtherreduced by using FFT techniques.

Noise Estimation When noise variance σN is unknown, EM-basedestimation steps or marginal likelihood updates can be integrated intoAlgorithm 7.1 to concurrently estimate the noise variance [20, 67]. Thenecessary marginal parameters are readily obtained from WY by (4.2)and (4.3):

VY = σ2NI− σ4

NWY

mY = y − σ2NWY y.

Note that with the current parameterization of the sparse prior, an up-date of σN requires an expensive matrix inversion. Integrating σ2

N intothe prior, as suggested in [36] resolves this issue.

7.4.1 Multi-Dimensional Fast Sparse Bayesian Learning

Now consider N multi-dimensional features Un ∈ Rd with d > 1, asshown in Figure 7.3. We will denote the d columns of the dictionarythat correspond to feature Un as Θn.Our goal is to extend the advantages of the fast SBL to estimation of the,presumably, sparse multi-dimensional features. To this end, we recog-nize that most steps in the presented fast feature selection Algorithm 7.1can be easily extended to multi dimensional Un by using standard mes-sage passing rules. The difficulty lies in the adaptation of the marginallikelihood maximization; For d > 1 the marginal log-likelihood (withrespect to γn) is seen from (B.24) using WUn =

(γ2nI +←−VUn

)−1and

Page 118: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.4 Fast Sparse Bayesian Learning 103

+

Θn

×

Un

γn

N (0, I)

. . . . . .

Figure 7.3: To accommodate multi-dimensional features, the inputsin Figure 7.2 are replaced by the displayed vector fea-ture Un while the rest of the factor graph remains un-changed.

WUnµUn = WUn←−mUn , the log-likelihood can be expressed as

2 log p(y|γ\i, γn) ∝ − log |γ2nI +←−VUn |︸ ︷︷ ︸

concave

−←−mTUn

(γ2nI +←−VUn

)−1←−mUn︸ ︷︷ ︸

convex

.

(7.15)Two important observations can be made: First, the log-likelihood (7.15)is non-convex because it is a sum of a convex term and a concave term.Also, a closed-form solution as in (7.13) is not readily available. Secondly,consider the eigenvalue decomposition

←−VUn = Ddiag(Λ)DT (7.16)

where Λ , (λ1, . . . , λd) and define m , QT←−mUn . Then, (7.15) can bedecomposed into a sum of d scalar problems as in (7.13):

2 log p(y|γ\i, γ2n) ∝ −

d∑l=1

log(γ2n + λl) + m2

l

γ2n + λl

. (7.17)

Observe that each term in (7.17) is unimodal. Hence, we can concludethat if m2

l ≤ λl for all l ∈ [1, d] then θn = γ2n = 0 is a maximizer

of (7.15) and marginal log-likelihood optimization leads to thresholdingof γn (i.e., γn = 0). The sparsifying characteristic of this method carryover to the multi-dimensional case.

Page 119: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

104 Sparse Bayesian Learning in Factor Graphs

Computational Aspects One approach to maximize the marginallikelihood in (7.15) is by using the EM updates or the MacKay updatesshown in Table 7.1 for multi-dimensional features combined with theiteration technique mentioned in Section 7.3. However, such algorithmsoften exhibit two drawbacks relevant to fast feature selection algorithms:

• Once a feature is pruned from the model, i.e., with γ = 0, it ispermanently removed from the model.

• Convergence of the iterative schemes can be very slow. As a con-sequence, many inner loops are necessary to compute step 3) whenextending Algorithm 7.1 to multi-dimensional features.

We seek an (approximate) method to solve (7.13) that can introducedpruned features into the model and exhibits fast convergence. To thisend, we propose to adapt Algorithm 7.1 to the following scheme:

Algorithm 7.2: Fast Multi-Dimensional Feature SelectionInitialize all γ[0]

1 , . . . , γ[0]N to 0. Then perform the following steps until

convergence of all γk.

1) Select a feature Un as in step 1) of Algorithm 7.1. All γ[t]k with

k 6= n are not updated.

2) Compute messages (cf. Table 4.1):

WUn ← ΘTnWY Θn, (7.18)

WUnµUn ← ΘTnWY µY . (7.19)

If γ[t],2n = 0: test if feature should be introduced into the model in

3)If γ[t],2

n > 0: reestimate γ2n in 4)

3) First evaluateςn ← ‖WUnµUn‖2 − Tr WUn . (7.20)

If ςn > 0 the feature Un is added to the model and γn is reestimated(see next step). Otherwise, if ςn ≤ 0, the feature is not added,γ

[t+1]n ← 0 and skip to step 1) again.

Page 120: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.4 Fast Sparse Bayesian Learning 105

4) Initialize

G← WUn

and if more than 1 optimization iteration below is performed, com-pute also←−VUn =

(ΘTnWY Θn

)−1 − γ[t],2n I, ←−mUn =

(ΘTnWY Θn

)−1 WUnµUn ,

then γn is estimated by iterating the following steps for a fixednumber of steps:

i)

γ[t],2n ← min

Tr((←−mUn

←−mTUn−←−VUn

)G2)

Tr G2 , 0

(7.21)

ii)

G←(←−VUn + γ[t],2

n

)−1

5) Update WY and WY µY analogously to Algorithm 7.1. Specifically,for WY :

H←(ΘTnWY Θn −∆−1

γn I)−1

WY ← WY − WY ΘnHΘTnWY ,

where∆γn = γ[t+1],2

n − γ[t],2n .

Before presenting performance results for this algorithm, we provide in-sight into (7.20) and (7.21). First note that computation of ςn from (7.20)is very lightweight and thus allows to rule out features with few compu-tations, akin to (7.13). Furthermore, we have

γ2n = 0⇒ ςn ≤ 0, (7.22)

i.e., ςn ≤ 0 is a necessary condition to remove a feature. Empiricalresults shown below show that it is highly effective as well. The proofof (7.22) is presented in Appendix B.5 on 145.

Page 121: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

106 Sparse Bayesian Learning in Factor Graphs

The iterative optimization in (7.21) is a fixed-point iteration obtainedby setting the gradient of the marginal log-likelihood (B.25) to zero:

−2 d

dθnlog p(y|γ\i, θn) = 0

−Tr(WUn

)+ Tr

(W2Un←−mUn

←−mTUn

)= 0

Tr((θnI +←−VUn

)W2Un

)= Tr

(W2Un←−mUn

←−mTUn

)θn Tr

(W2Un

)= Tr

(W2Un

(←−mUn

←−mTUn −

←−VUn))

γ2n =

Tr(W2Un

(←−mUn

←−mTUn−←−VUn

))Tr(W2Un

) .

Since θn = γ2n is non-negative, negative θn values are projected back

onto zero. Empirical evidence shows that the proposed fixed-point it-eration converges very quickly, also to the value 0 where a feature maysubsequently be pruned from the model.

Performance We show simulations to compare performance of theproposed multi-dimensional extension of the fast feature selection algo-rithm with the exact algorithm that uses a line search method to max-imize the marginal log-likelihood (step 3) in Algorithm 7.1) in each up-date and with the oracle estimator, which is the MMSE-estimator of thefeatures knowing nonzero entries. For all methods the same schedulingscheme and convergence criteria has been used: the algorithms iteratethrough all feature groups in a round-robin fashion and are stopped oncethe difference of the estimated Γ between two loops (through all featuregroups) is less than 0.1% and the number of active (γ2

n > 0) features hasnot changed.To generate data for the sparse multi-dimensional features estimation weconstruct a basis Φ ∈ RM×N by drawingMN samples from a zero-meannormal distribution with variance 1 M (unit-energy basis vectors). Thefeatures vectors X ∈ RN have a fixed number of K nonzero sub-vectors,constructed by randomly setting d-dimensional sub-vectors of X to 1.Measurements are corrupted by Gaussian noise with variance σ2

N .Simulation results for N = 200, M = 100, K = 5, and d = 5 are shown

Page 122: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.4 Fast Sparse Bayesian Learning 107

10−1

100

NMSE

1 fixed-point iteration2 fixed-point iterationsexact maximizationoracle

0 5 10 15 20 25 30 35 40 450

100

200

300

400

SNR [dB]

Avg.

max

imizationstep

s

1 fixed-point iteration2 fixed-point iterationsexact maximization

Figure 7.4: Sparse multi-dimensional feature estimation algorithm re-sults averaged over 100 basis, coefficient, and noise realiza-tions. Figure (a) shows the NMSE versus SNR. In (b) theaverage number of maximization steps until convergenceat different SNR points is shown.

Page 123: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

108 Sparse Bayesian Learning in Factor Graphs

in Figure 7.4. Definition of the SNR

SNR , Kd

Mσ2N

,

is standard and the NMSE as in (2.5) based on the MMSE estimate w.The proposed algorithm is executed with fixed numbers of iterations ofstep (7.21). We observe that one iteration yields nearly optimal perfor-mance, while merely two iterations are enough to perform as well as theexact algorithm. The former setting is particularly interesting, as it isfree of complex matrix operations such as matrix inversions or eigende-compositions; most similar in spirit to the standard (scalar) fast SBLalgorithm. The gap in error-performance between the fast feature selec-tion algorithms and the oracle estimator is constant over various SNRsand can be attributed to suboptimal solutions, due to the non-convexnature of the full log-likelihood.The benefit of criteria (7.20), is evident from the bottom plot in Fig-ure 7.4. The proposed method is able to prune many features from themodel without having to perform costly optimization of the marginallikelihood. Significant computational savings can be expected in practi-cal applications.Finally, it remains to check whether the reduction in maximization iter-ations comes at the expense of a higher number of active basis elementsduring processing and at convergence. Simulations show that the activeset sizes are almost the same for all algorithms.

7.5 Conclusion

Variance estimation of zero-mean Gaussian random variables in proba-bility models can be leveraged to infer sparsifying and/or heavy-tailedpriors in linear Gaussian models and, essentially, enabling sparse recov-ery of the respective signals. We have elaborated on the application ofthis concept to message passing algorithms and have formalized it forfactor graphics, representing a linear Gaussian probability model, andGaussian message-passing methods. The sparse estimation approachwith linear Gaussian models goes as follows:

1. Priors or weights on the variances are chosen according to a desiredheavy-tailed or sparsifying distribution.

Page 124: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

7.5 Conclusion 109

2. MAP/ML estimates of the multiplication factors viz. the variancesof the variational priors are computed.

3. Variances are fixed to their ML/MAP estimate and standard in-ference methods for linear Gaussian models can be employed.

We have presented various options to implement the second step aboveand discussed computational trade-offs.In the second part, we have presented a novel algorithm for multi-dimensional input (group-sparse) fast recovery based on a dual-precisionformulation of fast SBL algorithms and a new (approximate) marginallikelihood criterion. Experimental results have corroborated that signif-icant computational savings compared to standard marginal-likelihood-based SBL algorithm are achievable without compromising estimationperformance.

Page 125: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 126: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Chapter 8

Sparse Input Estimationin State Space Models

In this chapter, we concentrate on sparse-input estimation in SSM and,in a second step, develop a joint input and model estimation method. Tothis end, the general sparse estimation framework introduced in Chap-ter 7 is specialized to SSM and novel methods to for sparse input estima-tion in SSM (Section 8.1) are conceived. With a focus on a large varietyof practical applications where neither signal model is known a-priorinor the input signal is given, we present a blind deconvolution methodthat assumes sparse unknown input signals. This estimation methoddiscussed in Section 8.2, simultaneously infers an SSM representationsand an input signal. Effectiveness in real-world application and estima-tion performance of the proposed method are substantiated by means ofexperimental results with a real-world application.

8.1 Sparse Input Estimation

Let us particularize the general setting from Section 7.2; consider asingle-input single-output linear SSM (cf. (2.3)) with the observed sig-nal y1, . . . , yL ∈ R. The sparse input-estimation problem, assumes thatthe SSM is driven by a weakly sparse signal U0, . . . , UL−1 ∈ R which isiid distributed with a compressible prior p(uk) according to Section 7.1.Our goal is to estimate the input. The complete model results in thefactor graph shown in Figure 8.1, which is indeed a special case of ourgeneral framework introduced in Section 7.2.

111

Page 127: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

112 Sparse Input Estimation in State Space Models

· · · AX ′k−1 +

b

=Xk · · ·

X ′k

c

+N (0, σ2N )

yk

×

Uk−1

N (0, 1)

max

Figure 8.1: A factor graph representation of our sparse-input SSM,where p(y|u) is decomposed into factors defined by theSSM in Section 8.1.

Another way to generalize the statistical model in Figure 8.1, is by usinginputs and outputs with different rates. When the input is sampled at ahigher rate than the observations, we obtain a super-resolution problem.On the other hand, when the output is available with shorter periodsrelative to the input process, the problem setting essentially encompassesmulti-hypothesis testing problems or multiple glue factor learning. Thesubsequent treatment can be readily extended in the aforementionedcases.

Page 128: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8.1 Sparse Input Estimation 113

8.1.1 Algorithm

The proposed algorithm relies on the factor graph depicted in Figure 8.1and uses Wand Wµ to efficiently compute the posterior statistics. Oncethe posterior statistics are computed, an update of the max box is per-formed according to an update rule in Table 7.1. A complete algorithmstatement is provided in Section A.4.There is a number of key features that make the proposed Gaussian mes-sage passing scheme based on dual-precision message a particularly effi-cient choice for sparse input estimation and a highly efficient algorithmin general. First, it is free of computationally costly matrix inversions.Second, the marginal statistics for Uk for any k ∈ [1, L] are simple pro-jections of dual precision quantities WXk

and WXkµXk

used in Kalmansmoothing:

VUk = σ2γk − (σ2γk)2bTWXkb

mUk = −σ2γkbTWXk

µXk.

In contrast to standard sparse recovery methods and owing to the localcharacter of the message passing algorithm, the proposed method scaleslinearly with the signal length L.With respect to numerical properties of the algorithm, as discussed inChapter 3, the matrix inversion-free computations allow applicability ofsparse input estimation to complex high-order state-space models.

8.1.2 Simulation Results

The favorable recovery performance of the proposed sparse input esti-mator is shown with a synthetic example. An exactly sparse signal isgenerated and passed through a highly resonating filter of order 12 andcorrupted by white Gaussian noise. Refer to Figure 8.2 and Figure 8.3for the original signal and the observed one for SNR of 30 dB and 10 dBrespectively.In both examples, the Kalman-smoother based scheme described in Sec-tion 8.1.1 with iid Student’s t prior for the sparse input signal sampleswith ν = 10−4 was used. The variances viz. multiplication factors, wereobtained with the EM-based method from Table 7.1 in 10 iterations.

Page 129: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

114 Sparse Input Estimation in State Space Models

-0.05

0

0.05

0.1

y

50 75 100 125 150 175 200 225 250 275 300 325 350

−2

0

2

Index

true input Uestimate ULASSO-based estimate

Figure 8.2: Input estimation with SSM and sparsity promoting iidprior using simulated data with SNR of 30dB. Also shownis the LASSO estimator for recovery of the sparse input.

Page 130: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8.1 Sparse Input Estimation 115

−0.2

−0.1

0

0.1

0.2y

50 75 100 125 150 175 200 225 250 275 300 325 350

−1

0

1

Index

true input Uestimate U

Figure 8.3: Input estimation with SSM and sparsity promoting iidprior using simulated data with SNR of 10 dB.

Page 131: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

116 Sparse Input Estimation in State Space Models

The LASSO estimate1 [33] of the sparse input signal is also depicted forcomparison. We observe that, for this example the LASSO estimate doesnot work well due to strong coherence in the dictionary, which originatesin the slowly-decaying character of the impulse responses. Our proposedBayesian type-II estimator appears to be able to cope with coherencemuch better, which confirms observations made in [77]. Apart frompotentially inferior estimation performance for sparse input estimation,compared to the proposed Kalman-filter based scheme, LASSO-basedmethods are computationally more demanding and do not scale linearlyin the signal length.

8.2 Blind Deconvolution

In practical applications, in addition to the unknown input signal, theunderlying model, i.e., the SSM representation, might also be unknowna-priori. Sensible estimates may emerge by taking advantage of the(weak) sparsity assumption imposed on the input signal to eliminate theinherent ambiguity between unknown input signal and unknown SSMrepresentation. In general, methods that simultaneously estimate a ran-dom process and a dynamical system are termed blind deconvolutionschemes.In order to derive a blind scheme, let us write (7.11) as a joint minimiza-tion problem

argminγ,H

−2 log p(y|γ,H) = argminγ,H

minu

log∣∣HΓHT + σ2I

∣∣+ ‖y −Hu‖2

σ2

+L∑k=0

u2k

γk− log f(γk). (8.1)

To avoid an obvious scaling ambiguity, we impose that H is normalized,i.e., the corresponding impulse response h = h1, h2 . . . of the SSM hasenergy 1.

8.2.1 Type-I Estimators vs. Type-II Estimators

While (8.1) is a type II estimate of the compressible input Uk, one maywonder if type I estimators such as LASSO, which correspond to a similar

1The regularization weight is set such that the estimation error is minimal.

Page 132: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8.2 Blind Deconvolution 117

optimization problem

argminu,H

−2 log p(y,u|H) = argminu,γ,H

σ−2‖y −Hu‖2

+L∑k=0

u2k

γk+ log f(γk). (8.2)

may also prove useful for blind deconvolution. The answer is that typeII estimator should always be preferred in this case.To elaborate on this statement, observe that the sole difference be-tween (8.2) and (8.1) is the first log-determinant term. The term pe-nalizes large γ values. It also regularizes the structure of the SSM repre-sented by H in a highly desirable way. If the SSM is trivial, i.e., A = 0,b = [1, 0, . . . , 0]T, and c = [1, 0, . . . , 0] then H is the identity operator.This choice of SSM is always penalized by the log-determinant term;From Hadamard’s inequality we now observe that

log∣∣HΓHT + σ2I

∣∣ ≤ L∑k=0

log([HΓHT]k,k + σ2), (8.3)

where equality holds when HΓHT is a diagonal matrix. Since, typically,√γkand uk will be similar to yk (in either (8.2) and (8.1) due to the

quadratic term), the diagonal entries

[HΓHT]n,n =n−1∑k=0

h2kγn−k

can be seen to be roughly equal to y2k. Looking at (8.3), we then observe

that the righthand side corresponds to the trivial SSM estimate H. Wecan thus conclude that the log-determinant term will always penalize2

the trivial SSM compared to other SSMs; the penalty acting as a drivertowards parsimonious SSMs.

2 The penalties added by the log-determinant term are commonly large. This maybe seen for example, when impulse responses approximately form an orthogonal basis(i.e., time shifted versions of the impulse response are mutually orthonormal) and,for simplicity, the number of observations is the same as the number of inputs. AfterEigenvalue decomposition the log-determinant then corresponds to

∑L

k=0 log γk + σ2

and as σ2 is small and many γk → 0 the term can take large negative values. ForH = 1, the log-determinant will be approximately

∑L

k=0 log y2k + σ2 and thus much

larger.

Page 133: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

118 Sparse Input Estimation in State Space Models

Experimentally, it was observed that LASSO-based blind deconvolutiontends to converge to trivial SSMs estimates and thus, give meaninglessresults.

8.2.2 Algorithm

The objective (8.1) may be conveniently solved by coordinate descentin H or γ, while minimization over u corresponds to computation ofthe Gaussian messages in the graph shown in Figure 8.1. Both descentsteps are derived from the corresponding EM update. Specifically, theproposed blind deconvolution method, follows from EM-based multiplieroptimization rules (cf. Table 7.1) for the local EM boxes in Figure 8.1and using EM-based identification of the SSM in AR form. Particularly,we alternate between: i) sparse input estimation using the current SSMestimate and then ii) a system identification step, similar to AlgorithmA.4, to estimate A and c.Computational complexity of both maximization steps is similar, as bothnecessitate a full Gaussian message passing smoothing step in order toget the marginal statistics of Uk or Xk.

Initialization

The input estimate in the first iteration can be considered proportionalto the energy in y weighted by the spectrum of the initial SSM. Withno prior knowledge on model or u, an instantaneous energy detector issensible initial choice. To this end, the initial SSM is initialized with

A = 0, (8.4)

b as all ones vector, and c is drawn randomly and scaled such thatcb = 1.This type of initialization may also be argued by considering the EMupdates for A and c. We focus on c but the analogous applies to A. TheM-step for c, when estimated at multiple observations yk with k ∈ [1, N ]

c =(

K∑k=1

←−Wck

)−1 K∑k=1

←−Wck←−mck

= W−1c Wcmc (8.5)

Page 134: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8.2 Blind Deconvolution 119

can be interpreted as a weighted sum over local estimates W(k)c and

W(k)c m

(k)c for every k. The weights W(k)

c are computed with

W(k)c = VXk

+mXkmTXk.

Returning to the initialization, specifically the first system identificationupdate, we recognize that VXk

only depends on γk andmXkmTXk

on y2k

and γk. In the first iteration, when A = 0 was used to initialize, W(k)c is

proportional to the instantaneous observed energy. Hence, samples withhigh instantaneous energy, which would be expected to contain mostsignal information also get the largest weight on their local estimateonce the update is computed as in (8.5).

Energy-Constrained Updates

Since the EM algorithm maximizes the likelihood step wise, the unspe-cific input estimates in the first iterations imply a large ambiguity onthe SSM. Commonly, this results in system estimates that exhibit a gainwhich is far from 1.Using a non-trivial prior on γ, scaling of the dictionary or the sparsesignal does affect our estimation algorithms in two relevant ways: TheBayesian model might not be describing the observations well enoughor methods used to optimize the likelihood, such as EM or gradientupdates, will exhibit poor convergence properties.To prevent this effect, a constraint is imposed on the EM step for c.Under the assumption that the SSM is stable, the energy of the SSM’simpulse response h = h1, h2 . . . is forced to 1. This is, of course, equiv-alent to a norm constraint on the dictionary columns. Consider thefollowing expression for the energy of the impulse response of our SSM

‖h‖2 =∞∑n=1

h2n

=∞∑n=0

cAnbbT (AT)n c= c

( ∞∑n=0

AnbbT (AT)n)︸ ︷︷ ︸

C(A,bbT)

cT, (8.6)

Page 135: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

120 Sparse Input Estimation in State Space Models

where we used hn = cAnb and C(A, bbT) is known as controllabil-

ity Gramian [38] and can be obtained from the solution of a Lyapunovequation with A and bbT. When A is (approximately) constant duringan EM update of the SSM, a quadratic constraint on c can be conceivedfrom (8.6) and then added to the t+ 1th M-step for c:

minc

cWccT − 2cWcmc

s.t. c C(A[t], bbT

)cT = 1

.

This optimization problem is a quadratically constrained quadratic prob-lem and can be solved with Newton’s method [29].Since the current value c[t] will typically be close to the optimal value,we choose to linearize the quadratic constraint around the current valuewhich results in a quadratic programming problem with equality con-straints, which can be solved (in closed form) using an augmented linearsystem of equations.

8.2.3 Simulation Results

To evaluate the performance of the blind deconvolution algorithm wecreate a highly resonating 4th order system3. An input signal u ∈{−1, 0, 1}N of length N = 300 is randomly generated such that only8 components are non zero. Furthermore, the observations y are cor-rupted by noise such that SNR is 30dB.The algorithm is initialized according to Section 8.2.2 and after 20 EMiterations, the final result, shown in Figure 8.4, is conceived4. The timeshift is largely due to the ambiguity related to the choice of observationvector c in relation to the estimated signal U , i.e. there are often mul-tiple combinations for c and U that describe observations well. Thisambiguity can not be circumvented without enforcing more constraintson either c, U or both. Furthermore, in many applications a small con-stant time shift is negligible.

3The SSM has poles with absolute value ρ = 0.99 and phase randomly chosenfrom [0, π), while the zeros are drawn completely randomly.

4The sign of the input estimate, which is obviously non-identifiable was alwaysadapted such that it matches the true signal best.

Page 136: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8.2 Blind Deconvolution 121

−0.2

−0.1

0

0.1

0.2 measurements y

50 100 150 200 250 300 350

−1

0

1

Index

true input Uestimate U

Figure 8.4: Blind input estimation example using 4th order SSM andsimulated data y with length 300 and SNR 30dB.

Page 137: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

122 Sparse Input Estimation in State Space Models

BCG

0 100 200 300 400 500 600 700 800 900 1,000Index

ECG

Figure 8.5: A BCG measurement and an ECG measurement recordedsynchronously with a length of 1000 samples.

8.2.4 Heart Beat Detection for Ballistocardiography

Ballistocardiography (BCG), the measurement of forces on the body, ex-erted by heart contraction and subsequent blood ejection, is a methodfor non-obstructive monitoring heart conditions [2]. Certain relevantphysiological parameters (e.g., the heart-rate variability) require accu-rate heart beat time stamps; this means that individual heart beats mustbe detected, which is considerably more difficult than detecting pulseswhich are strictly periodic. This task is severely complicated by thetypical characteristics of BCG measurements: the beats cause large os-cillations in the mechanical measurement system, with the result thatbeats can not be discerned anymore. A typical measurement series is de-picted in Figure 8.5, where the BCG signal is plotted in comparison to

Page 138: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8.2 Blind Deconvolution 123

an Electrocardiography (ECG) measurement, recorded synchronously5.An additional complication in this application are movements of the pa-tient or test subject, which change the shape of the signals and limitsapplication of signal template or pattern matching techniques.We applied the proposed blind input estimation method to BCG mea-surements with the goal to detect single heart beats. To this end, a 4thorder single-input and single-output SSM is employed, where both SSMparameters (A matrix and c vector) are estimated from the data andit is initialized as described in Section 8.2.2. The weakly-sparse inputprocess is modeled with a Student’s t prior with ν = 1. The length ofthe signal was 3000. The algorithm was run for 100 iterations.The final input estimate are thresholded to read off the peaks’ timestamps. When two detected peaks are very close together (less than20 samples or 100ms), the smaller peak is discarded. To assess the per-formance of our proposed method, we compare it with a likelihood-basedfilter from [72]. The likelihood-based filter method utilizes a 16th orderSSM, which must first be estimated on a much longer signal window us-ing the ECG signal as a proxy for the unknown input. Strictly speakingthat likelihood filter is not a blind estimator. In addition, the SSM em-ploys two-dimensional outputs, as two-channel BCG measurements areavailable.The thresholded γ, the corresponding ECG measurement, and the resultof the likelihood-based filter6 are shown in Figure 8.6. If two peaks ofestimated γ are less than 20 samples (200ms) apart, the larger peak isselected and marked by a circle. Our input estimate has been shifted,because the electrical excitation of the heart and the physical effects ofthe blood flow are shifted and because blind input estimation estimatesthe input signal up to a time shift.It is evident that blind input estimation is able to detect well individualheart beats and yields comparable performance as the non-blind likeli-

5The BCG and ECG measurements were recorded by Daniel Waltisberger fromthe electronics laboratory (IFE) at ETH. The provided BCG signal, was subject todecimation from 1000Hz to 200Hz, to reduce noise, but primarily to reduce compu-tational costs. A 10-th order, zero-phase bandpass filter (0.5Hz-30Hz) was combinedwith a second order IIR notch filter at 50Hz. The lower cut-off was required forfiltering respiratory movements, the upper cut-off is used to attenuate noise, whereasthe notch for filtering the 50 Hz mains.

6The likelihood-based filter processes data in a window-based fashion. The lengthof a window is adjusted such that it contains one heart beat with high probability. Thelargest likelihood peak in each window corresponds to the estimated beat position.

Page 139: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

124 Sparse Input Estimation in State Space Models

proposed method γdetected heart beats

ECG

500 1,000 1,500 2,000 2,500 3,000

likelihood filter

Figure 8.6: Blind pulse (heart beat) detection from a ballistocardio-graphic signal. The top plot shows the estimated time-varying input variance and detected heart beats. The ECGsignal (middle) serves for validation (i.e., to provide theground truth). The bottom plot shows the result of the(non-blind) heart beat detection method [72].

Page 140: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

8.3 Conclusions and Outlook 125

hood filter. Interestingly, when multiple close inputs are estimated, thelikelihood filter suffers a drop in the overall likelihood level (see, e.g.,around sampling index 1900). It is indicative that the model SSM doesnot describe the data well around that peak.

8.3 Conclusions and Outlook

We have developed an efficient and robust method to estimate weaklysparse (input) signals based on zero-mean normal inputs with time-varying variance and (iterative) Gaussian message passing. The ap-proach was then extended to the case where the linear system is un-known and must be estimated as well. The practicality of the proposedapproach has been demonstrated by a real-world example.Our algorithm has three appealing properties that set it apart from othercommon sparse recovery methods. Firstly, no sparsity controlling param-eter needs to be set to obtain good estimation performance, unlike forinstance the regularization parameter in LASSO methods. On the otherhand, it has been observed that even when sparsity is low, our methodconverges stably towards standard MMSE input estimation. Secondly,input estimation performs well when estimating inputs in models withslowly decaying impulse responses. Typically these kind of sparse recov-ery problem is accompanied by highly coherent dictionaries. Thirdly, itseems that the proposed blind input and model estimation method rulesout trivial solutions (e.g., a one-to-one pass-through SSM). In contrast tostandard sparse methods for blind estimation in e.g., image processing,our scheme requires only mild additional constraint on the estimatedsystem model.Generalizations of the proposed weakly-sparse input estimation methodto to multi-dimensional observations or non-stationary conditions (e.g.,time varying SSMs) are immediate. The ease to cover these cases comesfrom the flexibility of the SSM-based approach.Another advantage of SSM-based Gaussian message passing is the easeto adapt fixed-interval algorithms to recursive (online) settings and vice-versa. Another research direction are recursive algorithms for sparseinput estimation and blind estimation. To this end, the idea of recursiveKalman smoothing may be convenient.Another interesting extension to the weakly-sparse input signal prior

Page 141: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

126 Sparse Input Estimation in State Space Models

are multi-dimensional inputs based on group-sparse features (cf. Sec-tion 7.1.1). For instance, blind system identification for an applicationas e.g., in Chapter 6, where distinct hammer hits on a sensor (sparseinput) with varying spatial direction are used for model identification,might become practical.Moreover, distinct sampling rates for inputs and outputs open a wide-range of options: from super-resolution to multiple-hypothesis-testingtype of problems (see also Section 8.1). In particular, combining weaklysparse input estimation with continuous-time input estimation or (gluefactor-based) pulse population learning might lead to interesting novelBayesian methods. The former would enable recovery of impulses ina continuous-time input signal that lie off the sampling grid. In thelatter case, our blind scheme might be adapted to localize pulses priorto learning.

Page 142: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Appendix A

Algorithm Statements

A.1 Square-Root Smoothing

Algorithm A.1: Smoothing with −→C and ←−SGiven a length K SSM and initial message −→µinit and ←−µinit.

1) Initialize the forward factor −→CX1 with the Cholesky factorizationof −→µinit and −→mX1 with −→minit.

2) Perform a forward sweep through the graph using, in a generalSSM, (I.3) and (I.4), followed by (I.9) (with −→CY = B chol (VU ))and (I.10), and (III.1) (with −→CY = chol (VN )) and (III.2).

3) If an initial message is given, the backward factor ←−SX1 is initializedanalogously to 1). Otherwise initialize to 0.

4) Perform a backward sweep through the graph using messages ←−SXk

and ←−SXk

←−mXk. Tabulated rules are applied in a dual fashion to 2).

5) Compute the marginal covariance factor and mean by using (II.1)and (II.2) with −→CXk

and ←−SXk. Eventually, obtain VX by squaring

the covariance factor.

A.2 Continuous-Time Posterior Computation

Given a continuous-time SSM

dX(t) = AX(t)dt+ bU(t)dt

127

Page 143: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

128 Algorithm Statements

Y (t) = cX(t) +N(t)

with observations yk = Y (tk) at discrete moments tk for k ∈ [1,K], thefollowing algorithm leads to an estimate of the posterior probability forX(t) at any tk ≤ t ≤ tk+1.

Algorithm A.2: Continuous-Time Smoothing

1) Initialization of the forward message analogously to step 1) in Al-gorithm 4.1.

2) Perform forward message passing with −→VX(t) and −→mX(t) accordingto [10, (I.1) and (I.2)]. Note that messages need only be computedat observations tk.

3) Initialization of the backward message as in step 3) of Algorithm 4.1.

4) Backward recursion with messages WX(t) and WX(t)µX(t) com-puted at tk for all k ∈ Z using (IV.1) and (IV.2) with eA(tk+1−tk)

instead of A and updates (IV.7) and (IV.8).

5) Computation of posterior probability of X(t) for any tk as in step5) in Algorithm 4.1 and at any tk ≤ t ≤ tk+1 by computing first−→VX(t) with [10, (I.2)] and then with At , eA(tk−t) and At ,eA(t−tk):

VX(t) = −→VX(t) −−→VX(t)AT

t WX(tk)At−→VX(t),

mX(t) = At−→mX(tk) −

−→VX(t)AtWX(tk)µX(tk)

= AtmX(tk) + σ2U

∫ t−tk

0eAτbbTeA(τ−t+tk)dτWX(tk)µX(tk)

A.3 System Identification in Chapter 6

Algorithm A.3: Model Identification using EM

1) Given the order of the sensor model and parameters σ2u and σ2

n,initialize A[0] and c[0] with AR form SSM generated according tophysical model in Appendix F. Fix b.

Page 144: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

A.3 System Identification in Chapter 6 129

2) At iteration j, perform dual-precision-based forward-backward mes-sage passing (see Section 4.3.2) in the factor graph of the SSM withparameters fixed to the current estimates A[j] and c[j] to obtainthe Gaussian messages VXk

andmXkfor the posterior probability

over Xk. In addition, compute the likelihood according to (see Fig-ure 4.1 for notation)

L[j] , p(y1, . . . , yn|A[j], c[j])

= −12

K∑k=1

log(

2π(σ2N + c[j]−→VXk

c[j],T))

+ (yk − c[j]−→mXk)2

σ2N + c[j]−→VXk

c[j],T.

3) Compute the EM messages according to Section 4.3.3 and [16]. Itturns out that A[0] and c[0] can be handled independently (theirEM messages factorize) and both can be expressed as Gaussianmessages. Specifically, let

ηθk(θ) ∝ e−θT←−Wθθk+2θT←−Wθ

←−mθ ,

be a Gaussian EM-message, where θk represents either the param-eters of A or of c at time step k in the factor graph. The specificparameters ←−Wak and ←−Wak

←−mak for parameter A are computed asin [16, (III.7) and(III.8)]

←−Wak = VXk+mXk

mTXk, (A.1)

←−Wak←−mak = VXk[X′

k+1]1 +mXkmT

[X′k+1]1 , (A.2)

where [X ′k+1]1 is the first entry of the state variable X ′k+1. Then,←−Wck and ←−Wck

←−mck , the parameters for c, are evaluated accordingto [16, (III.1) and (III.2)]

←−Wck = VXk+mXk

mTXk, (A.3)

←−Wck←−mck = ykmXk

. (A.4)

4) A new estimate c[k+1] is computed from all EM messages ηck(c)by

c[j+1] =(

K∑k=1

←−Wck

)−1

︸ ︷︷ ︸,Wc

K∑k=1

←−Wck←−mck︸ ︷︷ ︸

,Wcmc

(A.5)

Page 145: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

130 Algorithm Statements

and likewise for A[j+1].

5) If the termination criteria (i.e., maximum number of iterations)is met or the estimates converged (i.e., sufficiently small relativechange in the log-likelihood), complete the algorithm. Otherwise,repeat steps 2)-4).

A.4 Sparse Input Estimation

Under the assumptions stated in Section 8.1, the following algorithmestimates a compressible input to an SSM.

Algorithm A.4: Sparse Input Estimation

1) Initialize the forward message −→VX1 ← Vinit,−→mX1 ← 0, where Vinit

is the solution of a DARE equation with σ2U = mink γkσ2

N .

2) Recursively passing through k ∈ [1, . . . , L−1] update messages andintermediate results:

i)−→VX′

k+1← A−→VXkAT + σ2

N ,BBT

−→mX′k+1← A−→mXk .

ii)

Gk+1 ←1

c−→VX′

k+1cT + σ2

N

,

Fk+1 ←(I−−→VX′

k+1cTGk+1c

),

−→VXk+1 ← Fk+1−→VX′

k+1,

−→mXk+1 ← Fk+1−→mX′

k+1−Gk+1

−→VX′k+1cTyk+1.

3) Initialize the auxiliary messages

WXK ←1σ2 c

Tc,

WXKµXK ←1σ2 c

T.

Page 146: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

A.4 Sparse Input Estimation 131

4) Perform a backward (in time) sweep through k ∈ [K − 1, . . . , 1]using intermediate computation results from 2):

i)

WX′k← ATWXk+1A,

WX′kµX′

k← ATWXk+1µXk+1 .

ii)

WXk ← FkTWX′

kFk + cTGkc,

WXkµXk ← FkTWX′

kµX′

k− cTGk

(c−→mX′

k− yk

)iii) Estimate the input:

VUk ← σ2γk − (σ2γk)2bTWXkb

mUk ← −σ2γkbTWXk

µXk

iv) Based on an update from Table 7.1, compute a new value forγk. Note that the previous step may be adapted to

WUk ← bTWX′k+1b,

WUkµUk ← bTWX′k+1µX′

k+1.

In particular if the EM update is used with weight functionf(γ):

γk ← f ′(m2Uk

+ VUk).

Page 147: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 148: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Appendix B

Proofs

Lemma B.1: Scaled Gaussian Max/Int LemmaConsider the function f(x,y) represented by Figure B.1 with variablesx ∈ Rn and y ∈ Rm, and where VU and VN are invertible matrices.Then first it holds that∫

f(x,y) dx = |WX |−1/2 maxx

f(x,y). (B.1)

Lemma B.2: Scaled Gaussian Max/Int Lemma with WIn addition to the conditions stated in Lemma B.1, let q(x,y) be thequadratic function defined by

q(x,y) = log f(x,y)f(mX ,mY ) .

N (mY ,VN )

Θ

N (mX ,VU )

X

Figure B.1: Factor graph used to define a Gaussian Max/Int Lemmawith scaling.

133

Page 149: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

134 Proofs

Then it follows∫f(x,y) dx = log

∣∣WY

∣∣1/2 + maxx

q(x,y). (B.2)

Note that this lemma holds for arbitrary dimensions n and m.

Proof of Lemma B.1 and Lemma B.2: First we recognize that forall x = y it holds

f(x,y) = N (mX ,VX)N (mY ,VY )

= |VU |−1/2 |VN |−1/2e−q(x,y) (B.3)

where q(x,y) is a quadratic function. Now recall the Gaussian Max/IntTheorem and apply (B.3) to [47, Equation (138)]∫

f(x,y)dx = |VU |−1/2 |VN |−1/2e−minx q(x,y)

∫e−x

TWXxdx

= |VU |−1/2 |VN |−1/2e−minx q(x,y) |WX |−1/2 (B.4)

=∣∣VUΘTV−1

N Θ + I∣∣−1/2 |VN |−1/2

e−minx q(x,y)

=∣∣V−1N ΘVUΘT + I

∣∣−1/2 |VN |−1/2e−minx q(x,y)

=∣∣ΘVUΘT + VN

∣∣−1/2e−minx q(x,y), (B.5)

where we first used the identity (from message passing)

WX = ←−WX +−→WX

= ΘTV−1N Θ + V−1

U .

and subsequently applied determinant identities. Identity (B.1) followsfrom (B.4) and (B.3).Then Identifying in (B.3) the term

ΘVUΘT + VN = −→VY +←−VY︸ ︷︷ ︸W−1Y

shows (B.2). �

Page 150: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

B.1 Proofs for Chapter 3 135

B.1 Proofs for Chapter 3

Proof of Proposition 3.1: The proof consists of two steps: obtainingthe Jacobi matrix of an SSM reparametrization with respect to θ andthen expressing reparametrizations of (2.10) as matrix similarity trans-form. Given that rate matrices in two parameterizations are similar, itfollows that the eigenvalues are the same (see e.g. [34]) and thereforethat the local rate of convergence ρθ is invariant to the chosen SSMrepresentation.Assume that T ∈ Rn×n is a transformation matrix for state x ∈ Rn ina time-invariant SSM with A, B, and C. Using vectorization properties(cf. e.g. [34]), in the transformed SSM x′ , Tx the parameter vector θis linearly related to the transformed parameter vector θ′ as well:

vec (A)vec (B)vec (C)

︸ ︷︷ ︸

θ

=

TT ⊗ T−1

Id ⊗ T−1

TT ⊗ Im

︸ ︷︷ ︸

vec (A′)vec (B′)vec (C′)

︸ ︷︷ ︸

θ′

,

where d and m are the column dimension of B and the row dimensionof C, respectively. The local convergence rate is given in (2.10) andbased on the parametrization θ′. Applying the chain rule and using theJacobian dθ

dθ′ = Π, we can establish similarity of the rate matrices fordifferent parameterizations:

∇M(θ′) = I−(∂2Q(θ′|θ′)∂θ′,2

∣∣∣∣∣θ′=θ′

)−1d2`(θ′)dθ′2

= I−(

ΠT ∂2Q(θ|θ)∂θ2

∣∣∣∣∣θ=θ

Π)−1

ΠT d2`(θ)dθ2 Π

= Π−1

I−(∂2Q(θ|θ)∂θ2

∣∣∣∣∣θ=θ

)−1d2`(θ)dθ2

Π

= Π−1∇M(θ)Π.

Page 151: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

136 Proofs

B.2 Proofs for Chapter 4

Proof of Proposition 4.1: From (IV.7) and (IV.1), the aggregatedone time-step update for WXk is

WXk−1 = AT(I−cTGc−→VX)WXk(I−cTGc−→VX)TA+ATcTGkcA. (B.6)

We recognize that if WXk converges to a steady-state matrix, i.e.,

WXk → WX∞ ,

then (B.6) is a discrete Lyapunov equation with matrices Alp and Qlp.According to [38], WXk convergences if Alp is asymptotically stable, i.e.,if all its eigenvalues lie inside the unit circle. In fact, this property canbe shown as follows

det(Alp − λI) = det(AT(I− cTG∞c−→VX)− λI)

= det((I− cTG∞c−→VX)TA− λI)

= det(A(I−−→VXcTG∞︸ ︷︷ ︸,K

c)− λI)

and since the system is observable, the matrix K, which is recognizedas the Kalman gain of the system, stabilizes the closed loop matrix [38].Thus, the matrix A(I−Kc) has all poles inside the unit circle (irrespec-tive of A being asymptotically stable or not). �

Proof of (4.9): The proof uses the first part of the proof in [16, Ap-pendix E]. Starting from [16, Equation (163)], the following steps showthe desired equation.Using first (IV.1) and (4.2) we get

V(Y X) = −→V(Y X) −−→V(Y X)

(I0

)WY

(I 0

)−→V(Y X).

Then inserting expression [16, Equation (163)] and simplifying, we iden-tify the lower left corner of V(Y X) as VX′

k−1XTk. �

Page 152: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

B.3 Proofs for Chapter 5 137

−→µX(t)

X(t)eA∆ +

X(t′)

←−µX(t′)

N (0,V∆)

Figure B.2: Factor graph for interpolation with t′ = t+ ∆ > t.

B.3 Proofs for Chapter 5

Proof of Theorem 5.1: From (5.7), we have

u(t′) = σ2Ub

TWX(t′)µX(t′). (B.7)

Now consider Figure B.2, which shows a factor graph of the joint prob-ability density of the relevant variables (cf. [10]). Using Tables 2 and 3of [47] (specifically, equations (II.9), (II.10), (II.12), (III.2), and (III.8)from these tables), we have, first,

−→mX(t′) = eA∆−→mX(t), (B.8)

then ←−mX(t) = e−A∆←−mX(t′) and thus

←−mX(t′) = eA∆←−mX(t), (B.9)

and finally WX(t) = eAT∆ WX(t′)e

A∆ and thus

WX(t′) = e−AT∆ WX(t)e−A∆. (B.10)

Inserting these expressions into (B.7) yields (5.8). �

Proof of Theorem 5.2: We have to show that

lim∆→0

ˆu(tk,∆) = lim∆→0

ˆu(tk + ∆,∆) (B.11)

if cTb = 0. The relevant part of the factor graph for the left-hand side of(B.11) is shown in Figure B.3 (top), and the relevant part of the factor

Page 153: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

138 Proofs

X(tk)

N (0, σ2U/∆)

U(tk,∆)

b∆

+ =

cT

Y (tk)

X(tk)

(a)

X(tk)=

cT

Y (tk)

+X(tk)

N (0, σ2U/∆)

U(tk,∆)

b∆

(b)

X(tk)=

cT

Y (tk)

eA∆ +X(tk + ∆)

N (0, σ2U/∆)

U(tk + ∆,∆)

b∆

(c)

Figure B.3: Factor graph segments used for the proof of Theorem 5.2.

Page 154: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

B.3 Proofs for Chapter 5 139

graph for the right-hand side of (B.11) is shown in Figure B.3 (bottom).The former represents the equations

X(tk) = X(tk) + b∆U(tk,∆) (B.12)

and

Y (tk) = cTX(tk) (B.13)= cTX(tk) + cTb∆U(tk,∆). (B.14)

If cTb = 0, then (B.14) reduces to Y (tk) = cTX(tk), in which caseFigure B.3 (top) is equivalent to Figure B.3 (middle). But for ∆→ 0,Figure B.3 (bottom) also turns smoothly into Figure B.3 (middle). �

Proof of Theorem 5.3: We now prove a more general result:Let Y (c)(ω) ≡ G(ω)N(ω)U(ω) and

Y (ω) =∞∑

k=−∞Y (c)(ω + k

2πT

).

We wish to construct a linear estimator H(ω) that minimizes the ex-pected squared error over all frequencies

E[|H(ω)Y (ω)−N(ω)U(ω)|2

].

Equivalently by the orthogonality principle, for all ω ∈ R,

0 != E[Y (ω) (H(ω)Y (ω)−N(ω)U(ω))

]= H(ω)E

[|Y (ω)|2

]− E

[Y (ω)N(ω)U(ω)

]= H(ω)

(∑k∈Z

σ2U

∣∣∣∣N(ω + k2πT

)G(ω + k2πT

)∣∣∣∣2 + σ2

N

)− σ2

U |N(ω)|2G(ω),

where we used the definition of Y (ω) and Y (c)(ω) and the white noiseproperty E

[|U(ω)|2

]≡ σ2

U . Equation (5.10) now follows from the lastexpression and (5.9) is a consequence of the Poisson summation formulaapplied to H(ω)Y (ω).

Page 155: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

140 Proofs

N (0,VX1)

X1

N (0, σ2U )

N (y1, σ2N )

X2

N (0, σ2U )

N (y2, σ2N )

· · ·XK

N (0, σ2U )

N (yK , σ2N )

Θ

N (0,VU )

N (mY ,VN )

Figure B.4: General factor that describes a SSM realization of thevariational statistical model used for system identification.The factor graph is used in the proof of Theorem 6.1.

B.4 Proofs for Chapter 6

Proof of Theorem 6.1: First observe from Figure B.4 that any SSMcorresponds to the factor graph from Lemma B.2. After applying thelemma, the (scaled) log-likelihood L(A, b, c) , −2 log p(y,u|A, b, c) isseen to be

L(A, b, c) = minx1,...,xK+1

f(A, b, c) + g(x,A, b, c) (B.15)

subject to state sequence fulfills (2.1)

f(A, b, c) , log |θVX1θT + σ2

UΘΘT + σ2NIK |

g(x,A, b, c) , 1σ2N

K∑k=1

(yk − cxk)2 + 1σ2U‖b‖2

K∑k=1‖xk+1 −Axk − buk‖2

where Θ ∈ RK×K is the Toeplitz matrix constructed from the impulseresponse of the SSM i.e.,

[Θ]i,j ={cA(i−j−1)b if i− j > 0

0 otherwise,(B.16)

Page 156: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

B.4 Proofs for Chapter 6 141

and θ ∈ RK is[θ]i = cAi, (B.17)

and VX1 is the initial covariance, while the initial mean is assumed tobe 0.We first consider the second term g(x,A, b, c). With the substitution

ek ,1‖b‖

bT (xk+1 −Axk − buk) ,

for all k ∈ [1,K] the minimization (B.15), under consideration of theconstraints (2.1) on the state, becomes

g(x,A, b, c) = mine1,...,eK

1σ2N

K∑k=1

(yk − [Θ(u+ e)]k)2 + 1σ2U

K∑k=1

e2k

= 1σ2U

mine

λ ‖y −Θu+ Θe‖2 + ‖e‖2, (B.18)

where signals uk and ek were stacked to obtain vectors. Note that ecan be interpreted as an input-side innovations vector. The minimiza-tion (B.18) is a standard regularized least-squares problem (e.g., [39])and with the addition of the matrix inversion lemma, we obtain

σ2Ug(A, b, c) = λ ‖y −Θu‖2 − (y −Θu)T Θ

(λI + ΘTΘ

)−1 ΘT (y −Θu)

= (y −Θu)T (λI + ΘTΘ

)−1 (y −Θu) .

For long sequences, i.e., K � 1, the eigenvalues and eigenvectors of theToeplitz matrix Θ converge to the DFT values of the impulse responseand the DFT matrix columns [34], i.e.,

Θ ≈ DSDH, (B.19)

where S is a diagonal matrix

Si,i = S[i],

and D is the K ×K DFT matrix. Defining the DFT of the input signalU [k] and output signal Y [k] and S2 = SSH the log-likelihood converges

Page 157: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

142 Proofs

to

σ2Ug(x,A, b, c) ≈

(y −DSDHu

)H (λI + DS2DH)−1 (

y −DSDHu)

=(DHy − SDHu

)H (λI + S2)−1 (DHy − SDHu

)=

K∑k=1

|Y [k]− S[k]U [k]|2

λ+ |S[k]|2 . (B.20)

For the first term f(A, b, c) we also invoke (B.19)

f(A, b, c) ≈ log |sVX1sH + σ2

US2 + σ2NIK |

= log |VX1 |+ log |σ2US2 + σ2

NIK |+ log |V−1X1

+ (S−1s)H(S−1s)/(σ2U + σ2

N )|

≈K∑k=1

log(σ2U |Sk,k|2 + σ2

N ) + const.

where we defined s = DHθ, the d DFTs of the impulse responses fromeither one of the states to the output and used the matrix inversionlemma in the second step. In the last step we consider S−1s approxi-mately independent of A, b, c.The final approximation is thus

L(A, b, c) ≈K∑k=1

log(σ2U (|Sk,k|2 + λ)) + 1

σ2U

|Y [k]− S[k]U [k]|2

λ+ |S[k]|2

B.5 Proofs for Chapter 7

Remark B.1In all the following proofs, we consider d sparse priors on X ∈ Rd and ageneral Gaussian likelihood p(y|x). We also define

L(ξ) , log∫p(y|x)

d∏k=1

p(xk|ξ2k)dx

`(γ) , log∫p(y|x)

d∏k=1

p(xk|γk)dx.

Page 158: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

B.5 Proofs for Chapter 7 143

Proof of (V.1), (V.2) and (V.3): To derive the EM update of ξ[t]

we use the max box in Figure 7.1 as pAR-box, i.e., define X as hiddenvariable. Other choices are also possible and have been explored. Theadvantage of this choice of pAR when there is more than one prior, thejoint EM message factorizes and each ξ[t] may be optimized indepen-dently in the M-step.Let η(ξ[t]) be the EM message on ξ[t], which follows from

η(ξ) = EpAR[− log

√2πξ2 − x2

2ξ2

](B.21)

= −12 log (2πξ2)− m2

X + VX2ξ2 . (B.22)

When the weights f(ξ) ≡ 1, a new estimate ξ[t+1] is obatined by maxi-mizing η(ξ)

ξ[t+1] ← argmaxξ− log

(√2πξ2

)− m2

X + VX2ξ2

=√m2X + VX ,

which is (V.1). The alternative update (V.2) follows by applying (4.2)and (4.3) from Table 4.1.For a general f(ξ), the M-step is formally

ξ[t+1] ← argmaxξ− log

√2πξ2 − m2

X + VX2ξ2 + log f(ξ).

Recall (7.2) and then take the derivative with respect to ξ, which resultsin

dg?(α)dα

∣∣∣∣α=ξ−2

= m2X + VX .

and by properties of convex conjugates (see, e.g., [11, Section 3.3.2]),the derivative of the conjugate g?(α) is the inverse map of the derivativeof g(z) and thus, recalling the definition of g(z) = − log p(

√z) in (7.1),

Page 159: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

144 Proofs

the update is

ξ[t+1] ←(dg(z)dz

)−1/2∣∣∣∣∣z=m2

X+VX

=(− 1x

d log p(x)dx

)−1/2∣∣∣∣∣x=√m2X

+VX

.

Proof of (V.4) and (V.5): MacKay updates are based on a gradientstep on γ [49]. By looking at (2.11) and that the EM update factorizes,it is clear that the gradient for γ also decays into d scalar problems. Letus thus pick one γ and use (B.22) and (2.11):

d`(γ)dγ

= dη(γ)dγ

= −12

(1γ− m2

X + VXγ2

). (B.23)

Now with the current estimate γ[t], we set the gradient to 0 and substi-tute VX

[t] = VX/γ[t] :

γ(

1− VX[t]) = m2

X

from which (V.4) follows. Applying the definitions of WX and WXµXto the righthand side of (V.4) yields (V.5). �

Proof of (V.6) : Using scale factors [59], the marginal likelihoodevaluated on edge Xn corresponds to

p(y|ξ\i, ξn) =−→βXn←−βXn

√WXn2π e

−(WXnµXn)2

2WXn

∝√WXne

−(WXnµXn)2

2WXn , (B.24)

where the second equation is proportional with respect to ξn. Making thedependence on ξn explicit (WXn(ξn) and WXnµXn(ξn) = WXn(ξn)←−mXn),

Page 160: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

B.5 Proofs for Chapter 7 145

taking the derivative with respect to ξ2n and considering stationary points

of the (scaled) log-marginal probability, i.e., ξn that fulfill

0 = d

dξ2n

log p(y|ξ\i, ξn)

0 = d

dξ2n

log WXn(ξn)−←−m2XnWXn(ξn)

W ′Xn(ξn)WXn(ξn)

= ←−m2XnW

′Xn(ξn)

1WXn(ξn)

= ←−m2Xn

ξ2n = ←−m2

Xn −←−VXn

γn =√←−m2Xn−←−VXn

where W ′Xn(ξn) denotes the derivative with respect to ξ2n and, finally,

using W ′Xn(ξn) > 0. It can be seen that the derivative of the log-marginalprobability is always negative for ξ2

n > min(←−m2Xn−←−VXn , 0) and non-

negative otherwise. �

Proof of (7.22): We first note that for γn = 0, trivially, WUn =←−VUn and WUnµUn = ←−V−1

Un←−mUn . Next recalling the Eigen decomposi-

tion (7.16) and applying it to (7.20) we obtain

ςn = −Tr←−V−1Un

+←−mTUn

(DΛDT)−2←−mUn

= −d∑l=1

1λl− m2

l

λ2l

= −d∑l=1

d

dθn

(log(θn + λl) + m2

l

θn + λl

)∣∣∣∣θn=0

= −2 d

dθnlog p(y|γ\i, θn)

∣∣∣∣θn=0

. (B.25)

Now if γ2n = 0 local optimality requires that the derivative at zero is

non-positive from which (7.22) follows. �

Page 161: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 162: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Appendix C

Spline Prior

When prior knowledge on real signals is limited to the fact that thesignals and some of its derivatives are continuous with bounded energy.In this case, signals may be modeled with a continuous-time Gaussianstochastic process X(n)(t) with n ∈ N arbitrary that is generated by(scaled) n-fold integration of white Gaussian noise with variance σ2

W asdepicted in Figure C.1 (with a parameter T that will be explained below).This process may also be expressed with a continuous-time SSM

dX = A(c)X + bdW, X(0) = X0, (C.1)X(n)(t) = cX,

(C.2)

with parameters

A(c)i,j =

{1T , if i = j − 1

0, otherwise,

b = [1/√T , 0, . . . , 0]T, c = [0, . . . , 0, 1].

The initial state is important when we want the prior to be able to modelsignals with large offsets.The parameter T is introduced to make the prior behave the same acrossdifferent time-scales. To explain this, consider two signals with differenttime scales T1 � T2 both modeled with the spline prior. The scalingeffectuates that both random signals will have the same average energyon their respective scale. Put it differently, at the fine resolution of T1,

147

Page 163: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

148 Spline Prior

WGN

1√T

∫1T

∫. . . 1

T

∫X1(t) X2(t) Xn(t)

Figure C.1: n-fold integration of white noise.

realizations of the random process will look the same as realizations ofthe other signal at T2. This is necessary to ensure consistent character-istics of the prior throughout different operating regimes (e.g., samplingfrequencies). For example, a large sensor will display resonances at lowfrequencies and desired estimated signal’s frequencies will be low as well.Nevertheless, the proposed spline prior remains scale free in the classicalsense as its distribution is homogenous with respect to time1.When X(n)(t) is eventually used together with a discrete-time system, areasonable choice of T is of course the sampling period and the equivalentdiscrete-time SSM with time-steps T is

Xk+1 = ATXk +Uk, Ukiid∼ N (0,VU ), X0 = X0, (C.3)

X(n)k = [0, . . . , 0, 1]Xk,

with

[AT ]i,j ={

1(i−j)! , if i ≥ j

0, otherwise,

[VU ]i,j = σ2W

|i− j|!(i− j + 1) ,

which is readily obtained using [9, Section 2.4] and the nilpotent propertyof A. It is apparent that the noise covariance VU is independent of Tand thus, the prior defined by (C.3) over Xn(t) indeed exhibits the same“behaviour” at different time scales, i.e., different sampling rates.

C.1 Relation to Splines

The presented nth order spline prior is related to nth order polynomialspline smoothing and spline interpolation [73]. A correspondence can

1This means that mean and variance will be homogenous functions in the variablet. A homogenous function obeys f(αt) = cαf(t).

Page 164: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

C.1 Relation to Splines 149

be established by considering MMSE/MAP estimation of Xn(t) from aset of observations y1, . . . , yK at times Tmin ≤ t1 ≤ . . . ≤ tK ≤ Tmaxcorrupted by Gaussian noise with variance σ2

N . Recall [10, Theorem 2],then using a non-informative initial state (i.e., V→∞), it follows thatthe estimate, xn(t), for t ∈ [Tmin, Tmax] minimizes

x(t) = argminx(t)

1σ2N

K∑k=1‖yk − x(tk)‖2 +

√T

2n−1

σ2W

∫ Tmax

Tmin

(dnx(t)dtn

)2dt.

(C.4)

Using Theorem (5.1) and the observation that in eA(c)t monomials of

maximum order n appear, we can conclude that x(t) is an nth orderpolynomial. In addition, a straightforward, but relevant, observation isthat λ ,

√T

2n−1σ2N/σ

2W trades data fit for smoothness and when λ→ 0,

(C.4) reduces to an interpolation spline. Only the this quantity λ, theratio, are relevant for this prior.Observe that our proposed scaling enforces more smoothness for largerT (e.g., sampling time) by making λ larger.A related version of objective (C.4) was first used for spline smoothingin [73] . Similar connections between estimation in SSMs and splinesmoothing have been pointed out by various authors [18,59].

Page 165: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 166: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Appendix D

Additional Material onDynamometer Filtering

D.1 Experimental Setup

The proposed methods are evaluated on experimental cutting data sets.Raw dynamometer identification and machining data were recorded byDaniel Spescha at the institute of Machine Tools and Manufacturing atETH Zurich. Relevant details on the experiment setups evaluated in thisthesis are given in Table D.1. The presented results are based on datafrom two experiments. Standard equipment was used in all experiments.The dynamometer sensor used to estimate the applied force and thereference sensor were the same in both experiments, but mounted ondifferent machines. The mounting is standard usage and consisted instacking and fixing a metal plate, dynamometer sensor, reference sensorand the work piece on top of each other. With both dynamometers forcesin x-direction, y-direction, and z-direction were recorded.The measurements employed for identification were recorded by solici-tation of the workpiece with an impulse hammer prior to the machiningprocess. During identification measurements the machine was turnedoff and the system in resting conditions. Solicitation with the impulsehammer was performed by the operator and the hammer also recordedthe applied force.Experimental machining data was produced by processing a workpieceat different cutting speeds and feed rates. Machining was made in xdirection and y direction. One machining test signal corresponds thus to

151

Page 167: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

152 Additional Material on Dynamometer Filtering

Parameter Experiment A Experiment B[EXPA] [EXPB]

Sampling frequency [Hz] 20480 16384Identification sets 3 4Cutting sets 53 3Signal length 8192 8192Experimental Setup

Sensor Kistler 9255BReference sensor Kistler MiniDynMachine Mikron VC1000 Fehlmann

Picomax 825

Table D.1: Specific details on the experiments performed at Inspire.This data was used to evaluate the performance of thepresent signal processing algorithms.

a complete cut through the workpiece with constant cutting parameters.

D.2 Measured Frequency Responses

In Figure D.1 the FRFs of the dynamometer sensor of setting [EXPA]are shown. The shown FRF estimates H(xx), H(xy), and H(xz) areestimated by the least-squares method (6.16) with parameters L = 3and K = 8192.The reference sensors’ frequency response is sufficiently flat up to fre-quencies of 1 kHz and for higher frequencies of interest, the frequencyresponse can be equalized reliably. The reference sensor does not dis-play resonant modes below frequencies of 5 kHz.

Page 168: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

D.2 Measured Frequency Responses 153

−10

0

10

20

|H|[dB

]

H(xx)

H(xy)

H(xz)

0 500 1,000 1,500 2,000 2,500 3,000 3,500−180

−90

0

90

f [Hz]

∠H

[deg]

Figure D.1: Measured frequency responses at output channel x ob-tained in experiment A (see Table D.1).

Page 169: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 170: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Appendix E

Learnings fromImplementing GaussianMessage PassingAlgorithms

Gaussian message passing algorithms with high-dimensional multivari-ate Gaussian messages may become numerically unstable. Once theseproblems occur, it is usually difficult to locate the root cause of theerrors. In the following, we present a few error metrics that are use-ful “diagnostic” tools in these situtations. For approaches to overcomesources of numerical errors, after they were identified, see Section 3. Allquantities used in the following are finite-precision approximations tothe exact quantities.It was shown in [69] that, in the first order, a numerical implementa-tion of a Kalman filter is exact (in the mean vector and the covariancematrix), if numerical errors caused by finite-precision updates are sym-metric. This type of errors can be avoided by carefully designing messageupdate computations such that they are symmetric (see, e.g., (4.6) in Al-gorithm 4.1). Obviously, symmetrizing the covariances, i.e., (−→V+−→VT)/2,is

155

Page 171: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

156Learnings from Implementing Gaussian Message Passing Algorithms

Backward Error of Steady-State Messages

When utilizing steady-state covariance matrices or precision matrices(cf. Section 6.2.2), the numerical precision of these matrices is impor-tant for the overall numerical stability of algorithms based on marginalprobabilities (e.g., EM). The precision of the steady-state matrices canbe assessed by estimating backward errors of the discrete-algebraic Ric-catti equation. In words, backward errors quantify the amount thatcoefficients of a system of equations have to be modified to obtain thefinite-precision solution, under the assumption that the system of equa-tions are calculated exactly.

Assume that −→VX is the finite-precision solution of the correspondingdiscrete-algebraic Riccatti equation with standard SSM notation. Wethen use the (Frobenius) norm of the relative backward errors

∥∥∥∥A(−→VX −−→VXCT

(C−→VXCT + Vn

)−1C−→VX)AT + BVuBT −

−→VX∥∥∥∥F

‖−→VX‖F

(E.1)as a metric to assess the loss in precision due to −→VX . An analogousreasoning is used for the precision matrix.

Sensitivity of Autoregressive Parameterization

The autoregressive parameterization of an SSM, whereas AT is alsoknown as companion matrix [28], is a numerically sensitive represen-tation of the system dynamics [70]. When we are just interested inmarginalization on the factor graph and we can assume that the coeffi-cients of A are not affected by rounding errors, high sensitivities are notof concern. In parameter estimation settings, however, the sensitivity ishighly relevant; Amongst many effects, a high sensitivity can make thediscrete-algebraic Riccatti equation problem ill conditioned [32] and itcan make recursive EM-based estimation of A unstable.Let us quantify the sensitivity of a nth order SSM in AR form withcoefficients a = [a1, . . . , an] (first row of A) in terms of the change ∆pj ofpole pj (the Eigenvalue of A) due to a perturbation δa on the coefficients.

Page 172: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

157

To first order, it can be shown that

∆pj = − 1a′(pj)

δa, (E.2)

where a′(x) is the first derivative of the polynomial

a(x) , xn − a1xn−1 − . . .− an. (E.3)

We thus propose to measure the sensitivity as

maxp∈p1,...,pn

1|a′(p)| . (E.4)

It was experimentally observed that large values, are linked to stabilityproblems.

Expectation-Maximization Updates

The log likelihood is, of course, the most important diagnostic tool forEM algorithms and should be implemented first. Given an SSM withvariables given as in (2.3) in Section 2.1 the likelihood is readily obtain-able from forward message passing:We have also found that consistency equations, in a wider sense backwarderrors, are well suited to detect numerical inaccuracies. For EM-basedestimation of the state-transition matrix A, one possible such equationis constructed from the posterior means around a “+”-factor. For anSSM (2.3) with one-dimensional input this is∑

k

∥∥∥mXk+1 −mX′k− bmUk

∥∥∥2/‖bmUk‖2,

which would be zero with exact arithmetics. In addition, empirical testsalso suggest that this relative error is of the same order as the actualrelative error due to finite precision arithmetics.

Page 173: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 174: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Appendix F

Multi-Mass ResonatorModel

Subsequently, we sketch out reasoning and observations for our resonantsystem model. The implementation details are standard.A simple example of a mechanical system with two masses is shownin Figure F.1. Coupling between the masses is modeled with a springand a damping element. Let H(f) be the FRF from F2 to x2. Theresulting magnitude and phase response ofH(f) are shown in Figure F.2.In this simple mechanical model parameters were chosen such that m1may represent a typical machine and mass m2 the small lightweightsensor on top of it. Specifically, the following assumptions lead to thedemonstrated model:

1) the lower mass m1 is considerably larger than the top mass,

2) the spring constants are similar, and

3) damping is assumed to be much larger for m1 than for m2.

SSMs are generated from general N -masses dynamical systems by seriesconcatenation of 2nd-order SSM representing one mass, spring, damp-ing system. The first element in the series is the mass connected tothe steady ground (m1 in Figure F.1). This model allows to constructSSM with reasonable structure that serve as initial estimates for systemidentification in Chapter 6.In addition, insights that hold for all realistic dynamometer sensor set-tings can be gained from the multi-mass models. Observe in the FRF

159

Page 175: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

160 Multi-Mass Resonator Model

massm2

massm1

x1

x2

F2

F1k2 ξ2

k1 ξ1

Figure F.1: Mechanical model of simple multi-mass swinger example.

in Figure F.2 includes i) a large resonance (at f = 0.5), ii) a resonanceanti-resonance mode (at f = 0.23), which are akin to what is seen inmeasured FRFs in Chapter 6. In Figure F.2, the FRF of m2 only (i.e.,isolated sensor) is shown as well. Comparing the two FRFs, it is evidentthat the coupling with m2 causes a shift in the resonance frequency ofm2.

Page 176: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

161

0

20

40

|H(f

)|[dB]

m1 fixedfull system

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−200

−150

−100

−50

0

f [Hz]

∠H

(f)[deg]

Figure F.2: Magnitude and phase of transfer function H(f) of x2 withinput F2. The masses are m1 = 25m2, spring constantsare k1 = 10 and k2 = 2, while the first mass is dampedwith coefficient 0.5 andm2 is damped with coefficient 0.01.Forces measured at m2 are proportional to x2.

Page 177: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 178: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Bibliography

[1] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithmfor designing overcomplete dictionaries for sparse representation,”Signal Processing, IEEE Trans. on, vol. 54, no. 11, pp. 4311–4322,2006.

[2] J. Alihanka, K. Vaahtoranta, and I. Saarikivi, “A new method forlong-term monitoring of the ballistocardiogram, heart rate, and res-piration,” Am J Physiol, vol. 240, no. 5, pp. R384–92, 1981.

[3] A. Amini, U. Kamilov, E. Bostan, and M. Unser, “Bayesian estima-tion for continuous-time sparse stochastic processes,” IEEE Trans.on Signal Processing, vol. 61, no. 4, pp. 907–920, Feb 2013.

[4] S. D. Babacan, R. Molina, M. N. Do, and A. K. Katsaggelos,“Bayesian blind deconvolution with general sparse image priors,”in ECCV. Springer, 2012, pp. 341–355.

[5] S. D. Babacan, R. Molina, and A. K. Katsaggelos, “Bayesian com-pressive sensing using laplace priors,” IEEE Trans. on Image Pro-cessing, vol. 19, no. 1, pp. 53–63, 2010.

[6] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximizationtechnique occurring in the statistical analysis of probabilistic func-tions of markov chains,” The annals of mathematical statistics, pp.164–171, 1970.

[7] G. Bierman, “Factorization methods for discrete sequential estima-tion,” Mathematics in science and engineering, vol. 128, 1977.

[8] D. Biermann, R. Hense, and R. Surmann, “Korrektur gemessenerZerspankräfte beim Fräsen - Inverse Filterung von Kraftmessun-

163

Page 179: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

164 Bibliography

gen als Werkzeug beim Ermitteln von Zerspankraftparameter,” wt- Werkstattstechnik online, vol. 102, no. 11, pp. 789 – 794, 2012.

[9] L. Bolliger, “Digital estimation of continuous-time signals using fac-tor graphs,” Ph.D. dissertation, ETH - Swiss Federal Institute ofTechnology, 2012.

[10] L. Bolliger, H. Loeliger, and C. Vogel, “LMMSE estimation andinterpolation of continuous-time signals from discrete-time samplesusing factor graphs,” vol. abs/1301.4793, 2013.

[11] S. Boyd and L. Vandenberghe, Convex optimization. Cambridgeuniversity press, 2009.

[12] G. Casella and R. L. Berger, Statistical inference. Duxbury PacificGrove, CA, 2001, vol. 2.

[13] V. Cevher, “Learning with compressible priors,” in Advances inNeural Information Processing Systems, 2009, pp. 261–269.

[14] R. Chartrand and W. Yin, “Iteratively reweighted algorithms forcompressive sensing,” in Proc. IEEE Int. Conf. Acoust., Speech,Signal Process., 2008, pp. 3869–3872.

[15] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk, “Iter-atively reweighted least squares minimization for sparse recovery,”Communications on Pure and Applied Mathematics, vol. 63, no. 1,pp. 1–38, 2010.

[16] J. Dauwels, A. Eckford, S. Korl, and H.-A. Loeliger, “Expectationmaximization as message passing - part I: Principles and gaussianmessages,” vol. abs/0910.2832, 2009.

[17] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likeli-hood from incomplete data via the EM algorithm,” Journal of theRoyal Statistical Society. Series B (Methodological), pp. 1–38, 1977.

[18] J. Durbin and S. J. Koopman, Time series analysis by state spacemethods. Oxford University Press, 2012, no. 38.

[19] S. Farahmand, G. B. Giannakis, and D. Angelosante, “Doubly ro-bust smoothing of dynamical processes via outlier sparsity con-straints,” IEEE Trans. on Signal Processing, vol. 59, no. 10, pp.4529–4543, 2011.

Page 180: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Bibliography 165

[20] A. C. Faul and M. E. Tipping, “Fast marginal likelihood maximisa-tion for sparse bayesian models,” in Proc. of the Ninth InternationalWorkshop on Artificial Intelligence and Statistics, Key West, FL„Jan 3-6 2003.

[21] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Free-man, “Removing camera shake from a single photograph,” in ACMTrans. on Graphics (TOG), vol. 25, no. 3. ACM, 2006, pp. 787–794.

[22] C. Févotte and S. Godsill, “A bayesian approach for blind separationof sparse sources,” Audio, Speech, and Language Processing, IEEETrans. on, vol. 14, no. 6, pp. 2174–2188, Nov 2006.

[23] M. A. Figueiredo, “Adaptive sparseness for supervised learning,”IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25,no. 9, pp. 1150–1159, Sept 2003.

[24] Z. Ghahramani and G. E. Hinton, “Parameter estimation for lineardynamical systems,” Technical Report CRG-TR-96-2, University ofTotronto, Dept. of Computer Science, Tech. Rep., 1996.

[25] S. Gibson and B. Ninness, “Robust maximum-likelihood estimationof multivariable dynamic systems,” Automatica, vol. 41, no. 10, pp.1667–1682, 2005.

[26] S. Gillijns and B. De Moor, “Unbiased minimum-variance inputand state estimation for linear discrete-time systems,” Automatica,vol. 43, no. 1, pp. 111–116, 2007.

[27] F. Girardin, D. Remond, and J.-F. Rigal, “High Frequency Cor-rection of Dynamometer for Cutting Force Observation in Milling,”Journal of Manufacturing Science and Engineering, vol. 132, 2010.

[28] G. H. Golub and C. F. Van Loan, Matrix computations. JohnHopkins University Press, 2012, vol. 3.

[29] G. Golub and U. von Matt, “Quadratically constrained least squaresand quadratic problems,” Numerische Mathematik, vol. 59, no. 1,pp. 561–580, 1991.

[30] R. Gribonval, “Should penalized least squares regression be inter-preted as maximum a posteriori estimation?” IEEE Trans. on Sig-nal Processing, vol. 59, no. 5, pp. 2405–2410, 2011.

Page 181: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

166 Bibliography

[31] R. Gribonval, V. Cevher, and M. E. Davies, “Compressible distribu-tions for high-dimensional statistics,” IEEE Trans. on InformationTheory, vol. 58, no. 8, pp. 5016–5034, 2012.

[32] T. Gudmundsson, C. Kenney, and A. Laub, “Scaling of the discrete-time algebraic riccati equation to enhance stability of the schursolution method,” IEEE Trans. on Automatic Control, vol. 37, no. 4,pp. 513–518, April 1992.

[33] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, andR. Tibshirani, The elements of statistical learning. Springer, 2009,vol. 2, no. 1.

[34] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge uni-versity press, 2012.

[35] D. R. Hunter and K. Lange, “A tutorial on mm algorithms,” TheAmerican Statistician, vol. 58, no. 1, pp. 30–37, 2004.

[36] S. Ji, D. Dunson, and L. Carin, “Multitask compressive sensing,”IEEE Trans. on Signal Processing, vol. 57, no. 1, pp. 92–106, 2009.

[37] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEETrans. on Signal Processing, vol. 56, no. 6, pp. 2346–2356, 2008.

[38] T. Kailath, B. Hassidi, and A. H. Sayed, Linear estimation.Prentice-Hall, 2000.

[39] S. M. Kay, Fundamentals of Statistical Signal Processing: Estima-tion Theory. Upper Saddle River, NJ, USA: Prentice-Hall, 1993.

[40] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphsand the sum-product algorithm,” IEEE Trans. on Information The-ory, vol. 47, no. 2, pp. 498–519, 2001.

[41] A. Levin, Y. Weiss, F. Durand, andW. T. Freeman, “Understandingand evaluating blind deconvolution algorithms,” in IEEE Conf. onComputer Vision and Pattern Recognition, 2009, pp. 1964–1971.

[42] A. Lewandowski, C. Liu, S. Vander Wiel et al., “Parameter expan-sion and efficient inference,” Statistical Science, vol. 25, no. 4, pp.533–544, 2010.

[43] L. Ljung, System Identification: Theory for the User. PearsonEducation, 1998.

Page 182: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Bibliography 167

[44] H.-A. Loeliger, “An introduction to factor graphs,” IEEE SignalProcessing Magazine, vol. 21, no. 1, pp. 28–41, Jan 2004.

[45] ——, Signal and Information Processing: Modeling, Filtering,Learning. ISI, 2013.

[46] H.-A. Loeliger, L. Bolliger, G. Wilckens, and J. Biveroni, “Analog-to-digital conversion using unstable filters,” in Information Theoryand Applications Workshop, 2011, pp. 1–4.

[47] H.-A. Loeliger, J. Dauwels, J. Hu, S. Korl, L. Ping, and F. Kschis-chang, “The factor graph approach to model-based signal process-ing,” Proc. of the IEEE, vol. 95, no. 6, pp. 1295–1322, June 2007.

[48] T. A. Louis, “Finding the observed information matrix when usingthe em algorithm,” Journal of the Royal Statistical Society. SeriesB (Methodological), pp. 226–233, 1982.

[49] D. J. MacKay, “Bayesian methods for backpropagation networks,”in Models of neural networks III. Springer, 1996, pp. 211–254.

[50] M. Magnevall, M. Lundblad, K. Ahlin, and G. Broman, “High Fre-quency Measurements of Cutting Forces in Milling by Inverse Filter-ing,” Machining Science and Technology, vol. 16, no. 4, pp. 487–500,2012.

[51] G. McLachlan and T. Krishnan, The EM algorithm and extensions.John Wiley & Sons, 2007, vol. 382.

[52] J. A. Palmer, D. P. Wipf, K. Kreutz-delgado, and B. D. Rao, “Vari-ational em algorithms for non-gaussian latent variable models,” inAdvances in Neural Information Processing Systems 18. MIT Press,2006, pp. 1059–1066.

[53] P. G. Park and T. Kailath, “New square-root smoothing algo-rithms,” IEEE Trans. on Automatic Control, vol. 41, no. 5, pp.727–732, May 1996.

[54] S. S. Park and Y. Altintas, “Dynamic compensation of spindle inte-grated force sensors with kalman filter,” Journal of Dynamic Sys-tems, Measurement, and Control, vol. 126, no. 3, pp. 443–452, 2004.

[55] S. Park and Y. Altintas, “Dynamic Compensation of Spindle In-tegrated Force Sensors With Kalman Filter,” Journal of DynamicSystems, Measurement, and Control, vol. 126, 2004.

Page 183: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

168 Bibliography

[56] N. Pedersen, C. Manchón, M. Badiu, D. Shutin, and B. Fleury,“Sparse estimation using bayesian hierarchical prior modeling forreal and complex linear models,” Signal Processing, 2015.

[57] K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” Tech.Rep., November 2012.

[58] C. Rasmussen and C. Williams, Gaussian Processes for MachineLearning. MIT Press, 2006.

[59] C. Reller, “State-space methods in statistical signal processing,”Ph.D. dissertation, ETH - Swiss Federal Institute of Technology,2013.

[60] M. Safonov and R. Chiang, “A schur method for balanced-truncation model reduction,” IEEE Trans. on Automatic Control,vol. 34, no. 7, pp. 729–733, 1989.

[61] R. Salakhutdinov, S. Roweis, and Z. Ghahramani, “Optimizationwith em and expectation-conjugate-gradient,” in ICML, 2003, pp.672–679.

[62] M. Seeger and D. P. Wipf, “Variational bayesian inference tech-niques,” IEEE Signal Processing Magazine, vol. 27, no. 6, pp. 81–91,2010.

[63] R. H. Shumway and D. S. Stoffer, “An approach to time seriessmoothing and forecasting using the em algorithm,” Journal of timeseries analysis, vol. 3, no. 4, pp. 253–264, 1982.

[64] D. Shutin, T. Buchgraber, S. R. Kulkarni, and H. V. Poor, “Fastvariational sparse bayesian learning with automatic relevance de-termination for superimposed signals,” IEEE Trans. on Signal Pro-cessing, vol. 59, no. 12, pp. 6257–6261, 2011.

[65] D. Shutin and B. H. Fleury, “Sparse variational bayesian sage al-gorithm with application to the estimation of multipath wirelesschannels,” IEEE Trans. on Signal Processing, vol. 59, no. 8, pp.3609–3623, 2011.

[66] R. Tibshirani, “Regression shrinkage and selection via the lasso,”Journal of the Royal Statistical Society, Series B, vol. 58, pp. 267–288, 1994.

Page 184: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Bibliography 169

[67] M. E. Tipping, “The relevance vector machine,” in Advances inNeural Information Processing Systems, vol. 12. MIT Press, 2000,pp. 652–658.

[68] P. Van Overschee and B. De Moor, “Subspace identification forlinear systems: Theory, implementation,” Methods, 1996.

[69] M. Verhaegen and P. Van Dooren, “Numerical aspects of differ-ent implementations,” IEEE Trans. on Automatic Control, vol. 31,no. 10, pp. 907–917, 1986.

[70] M. Verhaegen and V. Verdult, Filtering and system identification:a least squares approach. Cambridge university press, 2007.

[71] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finiterate of innovation,” IEEE Trans. on Signal Processing, vol. 50, no. 6,pp. 1417–1428, 2002.

[72] F. Wadehn, L. Bruderer, D. Waltisberg, T. Keresztfalvi, and H. A.Loeliger, “Sparse-input detection algorithm with applications inelectrocardiography and ballistocardiography.” International Conf.on Bio-inspired Systems and Signal Processing, 2015.

[73] G. Wahba, Spline models for observational data. Siam, 1990,vol. 59.

[74] G. Wilckens, “A new perspective on analog-to-digital conversion ofcontinuous-time signals,” Ph.D. dissertation, ETH - Swiss FederalInstitute of Technology, 2013.

[75] A. Wills and B. Ninness, “On gradient-based search for multivari-able system estimates,” IEEE Trans. on Automatic Control, vol. 53,no. 1, pp. 298–306, 2008.

[76] D. Wipf and S. Nagarajan, “Iterative reweighted and methods forfinding sparse solutions,” IEEE Journal of Selected Topics in SignalProcessing, vol. 4, no. 2, pp. 317–329, 2010.

[77] D. P. Wipf, “Sparse estimation with structured dictionaries,” in Ad-vances in Neural Information Processing Systems, 2011, pp. 2016–2024.

[78] D. P. Wipf and S. S. Nagarajan, “A new view of automatic relevancedetermination,” in Advances in neural information processing sys-tems, 2008, pp. 1625–1632.

Page 185: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

170 Bibliography

[79] D. P. Wipf, B. D. Rao, and S. Nagarajan, “Latent variable bayesianmodels for promoting sparsity,” IEEE Trans. on Information The-ory, vol. 57, no. 9, pp. 6236–6255, 2011.

[80] D. Wipf and B. Rao, “Sparse bayesian learning for basis selection,”IEEE Trans. on Signal Processing, vol. 52, no. 8, pp. 2153–2164,Aug 2004.

[81] C. J. Wu, “On the convergence properties of the em algorithm,”The Annals of statistics, pp. 95–103, 1983.

[82] T. T. Wu and K. Lange, “The MM alternative to EM,” StatisticalScience, vol. 25, no. 4, pp. 492–505, 11 2010.

[83] L. Xu and M. I. Jordan, “On convergence properties of the emalgorithm for gaussian mixtures,” Neural computation, vol. 8, no. 1,pp. 129–151, 1996.

[84] M. Yuan and Y. Lin, “Model selection and estimation in regressionwith grouped variables,” Journal of the Royal Statistical Society:Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.

[85] M. Zibulevsky and B. Pearlmutter, “Blind source separation bysparse decomposition in a signal dictionary,” Neural computation,vol. 13, no. 4, pp. 863–882, 2001.

Page 186: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

About the Author

Lukas Bruderer was born in Switzerland in 1985. He passed the Maturain Trogen AR and received his Dipl. El.-Ing. (MSc ETH EEIT) degreefrom ETH Zurich, Switzerland in 2009.In 2007, he was first an exchange student (exchange studies scholarship)and later a visiting scholar at Northwestern University Chicago IL USA.During his studies he also worked as an intern for Huber+Suhner AG.From 2009 to 2011, he was a research assistant with the Integrated Sys-tems Laboratory (IIS) at ETH Zurich. In 2010 he also was with DisneyResearch. Since 2011, he has been with the Signal and Information Pro-cessing Laboratory (ISI) at ETH Zurich.

171

Page 187: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso
Page 188: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Series in Signal and Information Processing

edited by Hans-Andrea Loeliger

Vol. 1: Hanspeter Schmid, Single-Amplifier Biquadratic MOSFET-C

Filters. ISBN 3-89649-616-6

Vol. 2: Felix Lustenberger, On the Design of Analog VLSI Iterative

Decoders. ISBN 3-89649-622-0

Vol. 3: Peter Theodor Wellig, Zerlegung von Langzeit-Elektromyo-

grammen zur Prävention von arbeitsbedingten Muskelschäden.

ISBN 3-89649-623-9

Vol. 4: Thomas P. von Hoff, On the Convergence of Blind Source

Separation and Deconvolution. ISBN 3-89649-624-7

Vol. 5: Markus Erne, Signal Adaptive Audio Coding using Wavelets and

Rate Optimization. ISBN 3-89649-625-5

Vol. 6: Marcel Joho, A Systematic Approach to Adaptive Algorithms

for Multichannel System Identification, Inverse Modeling, and

Blind Identification. ISBN 3-89649-632-8

Vol. 7: Heinz Mathis, Nonlinear Functions for Blind Separation and

Equalization. ISBN 3-89649-728-6

Vol. 8: Daniel Lippuner, Model-Based Step-Size Control for Adaptive

Filters. ISBN 3-89649-755-3

Vol. 9: Ralf Kretzschmar, A Survey of Neural Network Classifiers for

Local Wind Prediction. ISBN 3-89649-798-7

Vol. 10: Dieter M. Arnold, Computing Information Rates of Finite State

Models with Application to Magnetic Recording.

ISBN 3-89649-852-5

Vol. 11: Pascal O. Vontobel, Algebraic Coding for Iterative Decoding.

ISBN 3-89649-865-7

Vol. 12: Qun Gao, Fingerprint Verification using Cellular Neural

Networks. ISBN 3-89649-894-0

Vol. 13: Patrick P. Merkli, Message-Passing Algorithms and Analog

Electronic Circuits. ISBN 3-89649-987-4

Page 189: Rights / License: Research Collection In Copyright - Non … · 2020-04-23 · iv cussions-onresearchtopicsoroff-topic-someinfrontofwhiteboards and some just next to the ISI espresso

Vol. 14: Markus Hofbauer, Optimal Linear Separation and

Deconvolution of Acoustical Convolutive Mixtures.

ISBN 3-89649-996-3

Vol. 15: Sascha Korl, A Factor Graph Approach to Signal Modelling,

System Identification and Filtering. ISBN 3-86628-032-7

Vol. 16: Matthias Frey, On Analog Decoders and Digitally Corrected

Converters. ISBN 3-86628-074-2

Vol: 17: Justin Dauwels, On Graphical Models for Communications and

Machine Learning: Algorithms, Bounds, and Analog

Implementation. ISBN 3-86628-080-7

Vol. 18: Volker Maximillian Koch, A Factor Graph Approach to Model-

Based Signal Separation. ISBN 3-86628-140-4

Vol. 19: Junli Hu, On Gaussian Approximations in Message Passing Al-

gorithms with Application to Equalization. ISBN 3-86628-212-5

Vol. 20: Maja Ostojic, Multitree Search Decoding of Linear Codes.

ISBN 3-86628-363-6

Vol. 21: Murti V.R.S. Devarakonda, Joint Matched Filtering, Decoding,

and Timing Synchronization. ISBN 3-86628-417-9

Vol. 22: Lukas Bolliger, Digital Estimation of Continuous-Time Signals

Using Factor Graphs. ISBN 3-86628-432-2

Vol. 23: Christoph Reller, State-Space Methods in Statistical Signal

Processing: New Ideas and Applications. ISBN 3-86628-447-0

Vol. 24: Jonas Biveroni, On A/D Converters with Low-Precision Analog

Circuits and Digital Post-Correction. ISBN 3-86628-452-7

Vol. 25: Georg Wilckens, A New Perspective on Analog-to-Digital

Conversion of Continuous-Time Signals. ISBN 3-86628-469-1

Vol. 26: Jiun-Hung Yu, A Partial-Inverse Approach to Decoding Reed-

Solomon Codes and Polynomial Remainder Codes.

ISBN 3-86628-527-2

Hartung-Gorre Verlag Konstanz http://www.hartung-gorre.de