Multi Sensor Data Fusion - mit.bme.hu

Multi Sensor Data Fusion

Hugh Durrant-WhyteAustralian Centre for Field RoboticsThe University of Sydney NSW 2006

[email protected]

January 22, 2001

Version 1.2

c©Hugh Durrant-Whyte 2001

1

Multi-Sensor Data Fusion 2

Contents

1 Introduction 51.1 Data Fusion Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Probabilistic Data Fusion 92.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Probabilistic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Data Fusion using Bayes Theorem . . . . . . . . . . . . . . . . . . . 152.2.3 Recursive Bayes Updating . . . . . . . . . . . . . . . . . . . . . . . 192.2.4 Data Dependency and Bayes Networks . . . . . . . . . . . . . . . . 232.2.5 Distributed Data Fusion with Bayes Theorem . . . . . . . . . . . . 252.2.6 Data Fusion with Log-Likelihoods . . . . . . . . . . . . . . . . . . . 27

2.3 Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3.1 Entropic Information . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3.2 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.4 Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.3.5 The Relation between Shannon and Fisher Measures . . . . . . . . 40

2.4 Alternatives to Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.4.1 Interval Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.4.2 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.4.3 Evidential Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Multi-Sensor Estimation 473.1 The Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1.1 State and Sensor Models . . . . . . . . . . . . . . . . . . . . . . . . 483.1.2 The Kalman Filter Algorithm . . . . . . . . . . . . . . . . . . . . . 523.1.3 The Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.1.4 Understanding the Kalman Filter . . . . . . . . . . . . . . . . . . . 583.1.5 Steady-State Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.1.6 Asynchronous, Delayed and Asequent Observations . . . . . . . . . 633.1.7 The Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . 67

3.2 The Multi-Sensor Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . 713.2.1 Observation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.2.2 The Group-Sensor Method . . . . . . . . . . . . . . . . . . . . . . . 763.2.3 The Sequential-Sensor Method . . . . . . . . . . . . . . . . . . . . . 783.2.4 The Inverse-Covariance Form . . . . . . . . . . . . . . . . . . . . . 823.2.5 Track-to-Track Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.3 Non-linear Data Fusion Methods . . . . . . . . . . . . . . . . . . . . . . . 923.3.1 Likelihood Estimation Methods . . . . . . . . . . . . . . . . . . . . 93


3.3.2 The Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.3.3 The Sum-of-Gaussians Method . . . . . . . . . . . . . . . . . . . . 943.3.4 The Distribution Approximation Filter (DAF) . . . . . . . . . . . . 95

3.4 Multi-Sensor Data Association . . . . . . . . . . . . . . . . . . . . . . . . . 953.4.1 The Nearest-Neighbour Standard Filter . . . . . . . . . . . . . . . . 963.4.2 The Probabilistic Data Association Filter . . . . . . . . . . . . . . . 973.4.3 The Multiple Hypothesis Tracking (MHT) Filter . . . . . . . . . . . 1013.4.4 Data Association in Track-to-Track Fusion . . . . . . . . . . . . . . 1033.4.5 Maintaining and Managing Track Files . . . . . . . . . . . . . . . . 104

4 Distributed and Decentralised Data Fusion Systems 1054.1 Data Fusion Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.1.1 Hierarchical Data Fusion Architectures . . . . . . . . . . . . . . . . 1064.1.2 Distributed Data Fusion Architectures . . . . . . . . . . . . . . . . 1094.1.3 Decentralised Data Fusion Architectures . . . . . . . . . . . . . . . 110

4.2 Decentralised Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134.2.1 The Information Filter . . . . . . . . . . . . . . . . . . . . . . . . . 1134.2.2 The Information Filter and Bayes Theorem . . . . . . . . . . . . . . 1184.2.3 The Information filter in Multi-Sensor Estimation . . . . . . . . . . 1194.2.4 The Hierarchical Information Filter . . . . . . . . . . . . . . . . . . 1224.2.5 The Decentralised Information Filter . . . . . . . . . . . . . . . . . 125

4.3 Decentralised Multi-Target Tracking . . . . . . . . . . . . . . . . . . . . . 1274.3.1 Decentralised Data Association . . . . . . . . . . . . . . . . . . . . 1274.3.2 Decentralised Identification and Bayes Theorem . . . . . . . . . . . 129

4.4 Communication in Decentralised Sensing Systems . . . . . . . . . . . . . . 1304.4.1 Fully Connected and Broadcast Sensor Networks . . . . . . . . . . . 1314.4.2 Identification of Redundant Information in Sensor Networks . . . . 1324.4.3 Bayesian Communication in Sensor Networks . . . . . . . . . . . . 1364.4.4 The Channel Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 1384.4.5 Delayed, Asequent and Burst Communication . . . . . . . . . . . . 1404.4.6 Management of Channel Operations . . . . . . . . . . . . . . . . . . 1434.4.7 Communication Topologies . . . . . . . . . . . . . . . . . . . . . . . 145

4.5 Advanced Methods in Decentralised Data Fusion . . . . . . . . . . . . . . . 146

5 Making Decisions 1475.1 Actions, Loss and Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.1.1 Utility or Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.1.2 Expected Utility or Loss . . . . . . . . . . . . . . . . . . . . . . . . 1485.1.3 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . 149

5.2 Decision Making with Multiple Information Sources . . . . . . . . . . . . . 1505.2.1 The Super Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . 1505.2.2 Multiple Bayesians . . . . . . . . . . . . . . . . . . . . . . . . . . . 1505.2.3 Nash Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


1 Introduction

Data fusion is the process of combing information from a number of different sources toprovide a robust and complete description of an environment or process of interest. Datafusion is of special significance in any application where a large amounts of data must becombined, fused and distilled to obtain information of appropriate quality and integrity onwhich decisions can be made. Data fusion finds application in many military systems, incivilian surveillance and monitoring tasks, in process control and in information systems.Data fusion methods are particularly important in the drive toward autonomous systemsin all these applications. In principle, automated data fusion processes allow essentialmeasurements and information to be combined to provide knowledge of sufficient richnessand integrity that decisions may be formulated and executed autonomously.

This course provides a practical introduction to data fusion methods. It aims to in-troduce key concepts in multi-sensor modeling, estimation, and fusion. The focus of thiscourse is on mathematical methods in probabilistic and estimation-theoretic data fusion.Associated with this course is a series of computer-based laboratories which provide theopportunity to implement and evaluate multi-sensor tracking, estimation and identifica-tion algorithms.

Data fusion is often (somewhat arbitrarily) divided into a hierarchy of four processes.Levels 1 and 2 of this process are concerned with the formation of track, identity, orestimate information and the fusion of this information from several sources. Level 1and 2 fusion is thus generally concerned with numerical information and numerical fusionmethods (such as probability theory or Kalman filtering). Level 3 and 4 of the data fusionprocess is concerned with the extraction of “knowledge” or decisional information. Veryoften this includes qualitative reporting or secondary sources of information or knowledgefrom human operators or other sources. Level 3 and 4 fusion is thus concerned withthe extraction of high-level knowledge (situation awareness for example) from low levelfusion processes, the incorporation of human judgment and the formulation of decisionsand actions. This hierarchy is not, by any means, the only way of considering the generaldata fusion problem. It is perhaps appropriate for many military data fusion scenarios, butis singularly inappropriate for many autonomous systems or information fusion problems.The imposition of a “hierarchical” structure to the problem at the outset can also serve tomislead the study of distributed, decentralised and network-centric data fusion structures.Nevertheless, the separate identification of numerical problems (tracking, identificationand estimation) from decisional and qualitative problems (situation awareness, qualitativereporting and threat assessment) is of practical value.

This course focus mainly on level 1-2 type data fusion problems. These are concernedwith the fusion of information from sensors and other sources to arrive at an estimate oflocation and identity of objects in an environment. It encompasses both the direct fusion ofsensor information and the indirect fusion of estimates obtained from local fusion centers.The primary methods in level 1-2 fusion methods are probabilistic. These include multi-target tracking, track-to-track fusion, and distributed data fusion methods. Level 3-4 datafusion problems are considered in less detail. These involve the modeling of qualitative


information sources, the use of non-probabilistic methods in describing uncertainty andgeneral decision making processes. Level 3-4 data fusion, obviously, builds on level 1-2methods.

1.1 Data Fusion Methods

In any data fusion problem, there is an environment, process or quantity whose truevalue, situation or state is unknown. It would be unreasonable to expect that there issome single source of perfect and complete knowledge about the problem of interest andso information must be obtained indirectly from sources which provide imperfect andincomplete knowledge, using these to infer the information needed. There may well bemany sources of information that could be used to help obtain the required knowledge:some effect characteristic of a particular state may be observed, prior beliefs about possiblestates may be available, or knowledge about certain constraints and relations may exist.To use the available information to its maximum effect it is important to describe preciselythe way in which this information relates to the underlying state of the world.

In sensor data fusion problems these concepts of ‘world’, ‘state’, ‘information’ and‘observation’ can be made precise:

• The quantity of interest will be denoted by x. This quantity may describe anenvironment, process, statement or single number. Employing the terminology ofdecision theory, the quantity x will be called the state of nature or simply ‘thestate’. The state of nature completely describes the world of interest. The stateof nature can take on a variety of values all of which are contained in the set Xof possible states of nature; x ∈ X . The current state of the world is simply asingleton of this set. An environment model comprises this set of possible statestogether with any knowledge of how elements of this set are related.

• To obtain information about the state of nature an experiment or observation isconducted yielding a quantity z. The observation can take on a variety of valuesall of which are contained in the sample space Z. The information obtainedby observation is a single realization z ∈ Z from this set. If the information weobtain through observation is to be of any value we must know how this informationdepends on the true state of nature. That is, for each specific state of nature x ∈ Xan observation model is required that describes what observations we will makez = z(x) ∈ Z. The observation model describes precisely the sensing process.

• Given the information obtained through observation z, the goal of the data fusionprocess is to infer an underlying state of nature x. To do this we describe a functionor decision rule δ which maps observations to states of nature; δ(z)→ x ∈ X . Thedecision model must incorporate information about the nature of the observationprocess, prior beliefs about the world, and a measure of the value placed on accuracyand error in the state of the world. The decision rule is essential in data fusionproblems; it is the function that takes in all information and produces a singledecision and resulting action.


These three elements; the state, the observation model and the decision rule, are theessential components of the data fusion problem.

The most important problem in data fusion is the development of appropriate mod-els of uncertainty associated with both the state and observation process. This coursebegins in Section 2 by describing key methods for representing and reasoning about un-certainty. The focus is on the use of probabilistic and information-theoretic methodsfor sensor modeling and for data fusion. Probability theory provides an extensive andunified set of methods for describing and manipulating uncertainty. It is demonstratedhow basic probability models can be used to fuse information, to describe different datafusion architectures and to manage sensors and sensor information. Many of the datafusion methods described in the remainder of this course are based on these probabilis-tic methods. However, probabilistic modeling methods do have limitations. Alternativesmethods for describing uncertainty including fuzzy logic, evidential reasoning, and qual-itative reasoning methods, are briefly considered. In certain circumstances one of thesealternative techniques may provide a better method of describing uncertainty than prob-abilistic models.

Estimation is the single most important problem in sensor data fusion. Fundamentally,an estimator is a decision rule which takes as an argument a sequence of observations andwhose action is to compute a value for the parameter or state of interest. Almost all datafusion problems involve this estimation process: we obtain a number of observations from agroup of sensors and using this information we wish to find some estimate of the true stateof the environment we are observing. In particular, estimation encompasses problems suchas multi-target tracking. Section 3 of this course focuses on the multi-sensor estimationproblem. The Kalman filter serves as the basis for most multi-sensor estimation problems.The multi-sensor Kalman filter is introduced. Various formulations of the multi-sensorKalman filter are discussed. The problem of asynchronous and delayed data is describedand solved. Alternatives to the Kalman filter such as likelihood filters, are discussed. Theproblem of data association; relating observations to tracks, is introduced. The three keymethods for data association; the nearest neighbour filter, probabilistic data associationand the multiple hypothesis and track splitting filters, are described. Finally, the track-to-track fusion algorithm is described as a method for combining estimates generated bylocal estimators.

An increasingly important facet of data fusion is the design of system architecture.As the complexity of data fusion systems increases, as more sensors and more informa-tion sources are incorporated, so the need to provide distributed data fusion architecturesincreases. While many block-diagrams of distributed data fusion systems exist, thereare few mathematical methods for manipulating and fusing information in these archi-tectures. Section 4 describes the use of information-theoretic estimation methods in dis-tributed and decentralised data fusion systems. It is demonstrated how multiple-targettracking and identification problems can be effectively decentralised amongst a network ofsensors nodes. A detailed study of the required communication algorithms is performed.This focuses on issues of delayed, intermittent, and limited band-width communications.Advanced areas of interest including sensor management, model distribution and organi-


sation control are also briefly described.The final component in data fusion is the need to make decisions on the basis of infor-

mation obtained from fusion. These decisions encompass the allocation and managementof sensing resources, the identification of specific situations, and the control or deploy-ment of sensing platforms. Section 5 briefly introduces key elements of decision theorymethods. These include ideas of utility, risk, and robustness. Some applications in sensormanagement are described.

1.2 Bibliography

Data fusion is a generally poorly defined area of study. This is because data fusion methodsare based on techniques established in many other diverse fields. As a consequence manydata fusion books tend to be either quite broad summaries of methods, or are morequalitative discussions of architectures and principles for specific applications. Some ofthe key recommended books are described here:

Blackman: The recent book by Blackman and Popoli [9] is probably the most compre-hensive book on data fusion methods. It covers both level 1-2 multi-target tracking andidentification problems as well as level 3-4 methods in situation assessment and sensormanagement. Notably, it covers a number of current military systems in some detail andgives develops a number of specific examples of multi-sensor systems.Waltz and Linas: The book by Waltz and Linas [43] has become something of a classicin the field. It was one of the first books that attempted to define and establish the fieldof data fusion. The book consists of mainly qualitative descriptions of various data fusionsystems (with an emphasis on military systems). There are some quantitative methodsintroduced but they are described only at a superficial level. However the general overviewprovided is still of significant value.BarShalom: Yakkov Barshalom has written and edited a number of books on data fusionand tracking. Of note are his classic book [7] on tracking and data association, and histwo edited volumes on data fusion [5, 6]. Together these are reasonably definitive in termsof current tracking and data association methods.Artech House series: Artech house (www.artechhouse.com) are a specialist publishinghouse in the radar, data fusion and military systems area. They publish by far the mostimportant and advanced technical books in the data fusion area.

Additional references are provided at the end of these notes.


2 Probabilistic Data Fusion

Uncertainty lies at the heart of all descriptions of the sensing and data fusion process.An explicit measure of this uncertainty must be provided to enable sensory informationto be fused in an efficient and predictable manner. Although there are many methodsof representing uncertainty, almost all of the theoretical developments in this course arebased on the use of probabilistic models. There is a huge wealth of experience and methodsassociated with the development of probabilistic models of environments and information.Probabilistic models provide a powerful and consistent means of describing uncertaintyin a broad range of situations and leads naturally into ideas of information fusion anddecision making.

It could be, and indeed often is, argued that probability is the only rational way tomodel uncertainty. However, the practical realities of a problem often suggest alternativeuncertainty modeling methods. Probabilistic models are good at many things but it isoften the case that such a model cannot capture all the information that we need to defineand describe the operation of a sensing and data fusion system. For example it would bedifficult to imagine using probabilistic models alone to capture the uncertainty and errorintroduced by the dependence of, say, an active electro-magnetic sensor information onbeam-width and radiation frequency effects. Alternative methods of modeling uncertaintyare briefly described in Section 2.4.

Even with limitations, probabilistic modeling techniques play an essential role in de-veloping data fusion methods. Almost all conventional data fusion algorithms have prob-abilistic models as a central component in their development. It would be impossibleto talk about data fusion in any coherent manner without first obtaining a thoroughunderstanding of probabilistic modeling methods.

This section begins by briefly introducing the essential elements of probabilistic mod-eling: probability densities, conditional densities, and Bayes theorem. With these basicelements, the basic data fusion problem is described and it is shown how informationcan be combined in a probabilistic manner. From this, various data fusion architecturesare described and, mathematically, how they are implemented. The idea of informationand entropy is also introduced as a natural way of describing the flow of ‘information’ indifferent data fusion architectures. These tools, together with the Kalman filter are theunderlying basis for all the data fusion methods developed in this course.

2.1 Probabilistic Models

In the following, familiarity with essential probability theory is assumed, and some simplenotation and rules to be used throughout this course are introduced. A probability densityfunction (pdf ) Py(·) is defined on a random variable y, generally written as Py(y) or simplyP (y) when the dependent variable is obvious. The random variable may be a scalar orvector quantity, and may be either discrete or continuous in measure.

The pdf is considered as a (probabilistic) model of the quantity y; observation or state.The pdf P (y) is considered valid if;


1. It is positive; P (y) > 0 for all y, and

2. It sums (integrates) to a total probability of 1;

∫yP (y)dy = 1.

The joint distribution Pxy(x,y) is defined in a similar manner.Integrating the pdf Pxy(x,y) over the variable x gives the marginal pdf Py(y) as

Py(y) =∫xPxy(x,y)dx, (1)

and similarly integrating over y gives the marginal pdf Px(x). The joint pdf over nvariables, P (x1, · · · ,xn), may also be defined with analogous properties to the joint pdfof two variables.

The conditional pdf P (x | y) is defined by

P (x | y) =

P (x,y)

P (y), (2)

and has the usual properties of a pdf with x the dependent variable given that y takes onspecific fixed values. The conditional pdf P (y | x) is similarly defined.

The chain-rule of conditional distributions can be used to expand a joint pdf in termsof conditional and marginal distributions. From Equation 2,

P (x,y) = P (x | y)P (y). (3)

The chain-rule can be extended to any number of variables in the following form

P (x1, · · · ,xn) = P (x1 | x2 · · · ,xn) · · ·P (xn−1 | xn)P (xn), (4)

where the expansion may be taken in any convenient order. Substitution of Equation 3into Equation 1 gives an expression for the marginal distribution of one variable in termsof the marginal distribution of a second variable as

Py(y) =∫xPx|y(y | x)Px(x)dx. (5)

This important equation is known as the total probability theorem. It states that thetotal probability in a state y can be obtained by considering the ways in which y canoccur given that the state x takes a specific value (this is encoded in Px|y(y | x)), weightedby the probability that each of these values of x is true (encoded in Px(x)).

If it happens that knowledge of the value of y does not give us any more informationabout the value of x then x and y are said to be independent as

P (x | y) = P (x). (6)


With Equation 6 substituted into Equation 3

P (x,y) = P (x)P (y). (7)

A weaker form of independence can be defined through the important idea of conditionalindependence. Given three random variables x, y and z, the conditional distribution ofx given both y and z is defined as P (x | yz). If knowledge of the value of z makes thevalue of x independent of the value of y then

P (x | y, z) = P (x | z). (8)

This may be the case for example if z indirectly contains all the information contributedby y to the value of x. Conditional independence can be exploited in a number of differentways. In particular, applying the chain-rule to the joint probability density function onthree random variables x, y, and z

P (x,y, z) = P (x,y | z)P (z)

= P (x | y, z)P (y | z)P (z),(9)

together with the conditional independence result of Equation 8 the intuitive result

P (x,y | z) = P (x | z)P (y | z), (10)

is obtained. That is, if x is independent of y given knowledge of z then the joint probabilitydensity function of x and y conditioned on z is simply the product of the marginaldistributions of x and y each conditioned on z, analogously to Equation 7.

The idea of conditional independence underlies many of the data fusion algorithmsdeveloped in this course. Consider the state of a system x and two observations of thisstate z1 and z2. It should be clear that the two observations are not independent,

P (z1, z2) = P (z1)P (z2),

as they must both depend on the common state x. Indeed, if the two observations wereindependent (were unrelated to each other), there would be little point fusing the infor-mation they contain ! Conversely, it is quite reasonable to assume that the only thing thetwo observations have in common is the underlying state, and so the observations are in-dependent once the state is known; that is, the observations are conditionally independentgiven the state as

P (z1, z2 | x) = P (z1 | x)P (z2 | x).Indeed, for the purposes of data fusion, this would not be a bad definition of the state;simply what the two information sources have in common.


2.2 Probabilistic Methods

2.2.1 Bayes Theorem

Bayes theorem is arguably the most important result in the study of probabilistic models.Consider two random variables x and z on which is defined a joint probability densityfunction P (x, z). The chain-rule of conditional probabilities can be used to expand thisdensity function in two ways

P (x, z) = P (x | z)P (z)

= P (z | x)P (x). (11)

Rearranging in terms of one of the conditional densities, Bayes theorem is obtained

P (x | z) = P (z | x)P (x)

P (z). (12)

The value of this result lies in the interpretation of the probability density functionsP (x | z), P (z | x), and P (x). Suppose it is necessary to determine the various likelihoodsof different values of an unknown state of nature x ∈ X . There may be prior beliefsabout what values of x might be expect, encoded in the form of relative likelihoods inthe prior probability density function P (x). To obtain more information aboutthe state x an observation z ∈ Z is made. The observations made are modeled as aconditional probability density function P (z | x) which describes, for each fixed state ofnature x ∈ X , the likelihood that the observation z ∈ Z will be made; the probabilityof z given x. The new likelihoods associated with the state of nature x must now becomputed from the original prior information and the information gained by observation.This is encoded in the posterior distribution P (x | z) which describes the likelihoodsassociated with x given the observation z. The marginal distribution P (z) simply servesto normalize the posterior. The value of Bayes theorem is now clear, it provides a directmeans of combining observed information with prior beliefs about the state of the world.Unsurprisingly, Bayes theorem lies at the heart of many data fusion algorithms.

The conditional distribution P (z | x) serves the role of a sensor model. This distribu-tion can be thought of in two ways. First, in building a sensor model, the distributionis constructed by fixing the value1 of x = x and then asking what pdf in the variable zresults. Thus, in this case, P (z | x) is considered as a distribution on z. For example, ifwe know the true range to a target (x), then P (z | x) is the distribution on the actual ob-servation of this range. Conversely, once the sensor model exists, observations (numbers,not distributions) are made and z = z is fixed. From this however, we want to infer thestate x. Thus the distribution P (z | x) is now considered as a distribution in x. In thislatter, case, the distribution is known as the Likelihood Function and the dependenceon x is made clear by writing Λ(x) = P (z | x).

1Random variables are denoted by a bold face font, specific values taken by the random variable aredenoted by normal fonts.


In a practical implementation of Equation 12, P (z | x) is constructed as a function ofboth variables (or a matrix in discrete form). For each fixed value of x, a distribution inz is defined. Therefore as x varies, a family of distributions in z is created. The followingtwo examples, one in continuous variables and one in discrete variables, makes these ideasclear.

Example 1Consider a continuous valued state x, the range to target for example, and an ob-

servation z of this state. A commonly used model for such an observation is where theobservation made of true state is Normally (Gaussian) distributed with mean x and avariance σ2

z as

P (z | x) = 1√2πσ2

z

exp

(−1

2

(z− x)2

σ2z

). (13)

It should be clear that this is a simple function of both z and x. If we know the truevalue of the state, x = x, then the distribution is a function of z only; describing theprobability of observing a particular value of range (Normally distributed around the truerange x with variance σ2

z). Conversely, if we make a specific observation, z = z, thenthe distribution is a function of x only; describing the probability of the true range value(Normally distributed around the range observation z with variance σ2

z). In this case, thedistribution is the Likelihood Function.

Now assume that we have some prior belief about the true state x encoded in a Gaussianprior as

P (x) =1√2πσ2

x

exp

(−1

2

(x− xp)2

σ2x

).

Note, this is a function of a single variable x (with xp fixed). Bayes theorem can bedirectly applied to combine this prior information with information from a sensor, modeledby 13. First, an observation (technically, an experimental realisation), z, is made andinstantiated in Equation 13. Then the prior and sensor model are multiplied together toproduce a posterior distribution (which is a function of x only, and is sometimes referredto as the posterior likelihood) as

P (x | z) = C1√2πσ2

z

exp

(−1

2

(x− z)2

σ2

).

1√2πσ2

x

exp

(−1

2

(x− xp)2

σ2x

)

(14)

=1√2πσ2

exp

(−1

2

(x− x)2

σ2

)(15)

where C is a constant independent of x chosen to ensure that the posterior is appropriatelynormalized, and x and σ2 are given by

x =σ2x

σ2x + σ2

z

z +σ2z

σ2x + σ2

z

xp,


and

σ2 =σ2zσ

2x

σ2z + σ2

x

=

(1

σ2z

+1

σ2x

)−1

Thus the posterior is also Gaussian with a mean that is the weighted average of the meansfor the original prior and likelihood, and with a variance equal to the parallel combinationof the original variances.

Example 2Consider a simple example of the application of Bayes theorem to estimating a discrete

parameter on the basis of one observation and some prior information. The environmentof interest is modeled by a single state x which can take on one of three values:

x1: x is a type 1 target.x2: x is a type 2 target.x3: No visible target.

A single sensor observes x and returns three possible values:z1: Observation of a type 1 target.z2: Observation of a type 2 target.z3: No target observed.

The sensor model is described by the likelihood matrix P1(z | x):

z1 z2 z3

x1 0.45 0.45 0.1x2 0.45 0.45 0.1x3 0.1 0.1 0.8

Note, that this likelihood matrix is a function of both x and z. For a fixed value of thetrue state, it describes the probability of a particular observation being made (the rows ofthe matrix). When a specific observation is made, it describes a probability distributionover the values of true state (the columns) and is then the Likelihood Function Λ(x).

The posterior distribution of the true state x after making an observation z = zi isgiven by

P (x | zi) = αP1(zi | x)P (x)

where α is simply a normalizing constant set by requiring the sum, over x, of posteriorprobabilities to be equal to 1.

In the first instance we will assume that we do not have any prior information aboutthe possible likelihood of target types 1 and 2, and so we set the prior probability vectorto P (x) = (0.333, 0.333, 0.333). If we now observe z = z1, then clearly the posteriordistribution will be given by P (x | z1) = (0.45, 0.45, 0.1) (i.e. the first column of thelikelihood matrix above; the likelihood function given z1 has been observed).


If now we subsequently use this posterior distribution as the prior for a second obser-vation P (x) = (0.45, 0.45, 0.1), and we again make the observation z = z1, then the newposterior distribution will be given by

P (x | z1) = αP1(z1 | x)P (x)= α× (0.45, 0.45, 0.1)⊗ (0.45, 0.45, 0.1)= (0.488, 0.488, 0.024).

(where the notation ⊗ denotes an element-wise product).Notice that the result of this integration process is to increase the probability in both

type 1 and type 2 targets at the expense of the no-target hypothesis. Clearly, althoughthis sensor is good at detecting targets, it is not good at distinguishing between targets ofdifferent types.

2.2.2 Data Fusion using Bayes Theorem

It is possible to apply Bayes theorem directly to the integration of observations fromseveral different sources. Consider the set of observations

Zn= z1 ∈ Z1, · · · , zn ∈ Zn .

It is desired to use this information to construct a posterior distribution P (x | Zn)describing the relative likelihoods of the various values of the state of interest x ∈ Xgiven the information obtained. In principle, Bayes theorem can be directly employed tocompute this distribution function from

P (x | Zn) =P (Zn | x)P (x)

P (Zn)

=P (z1, · · · , zn | x)P (x)

P (z1, · · · , zn) .(16)

In practice it would be difficult to do this because it requires that the joint distributionP (z1, · · · , zn | x) is known completely; that is, the joint distribution of all possible combi-nations of observations conditioned on the underlying state2. However, it is usually quitereasonable to assume that given the true state x ∈ X , the information obtained from theith information source is independent of the information obtained from other sources. Thevalidity of this assumption is discussed below. With this assumption, Equation 8 impliesthat

P (zi | x, z1, · · · , zi−1, zi+1, · · · , zn) = P (zi | x), (17)

and from Equation 10 this gives

P (z1, · · · , zn | x) = P (z1 | x) · · ·P (zn | x) =n∏i=1

P (zi | x). (18)

2In Example 2 above, this would require the construction of likelihood matrix of size mn where m isthe number of possible outcomes for each observation and where n is the number of observations made.


Substituting this back into Equation 16 gives

P (x | Zn) = [P (Zn)]−1P (x)n∏i=1

P (zi | x). (19)

Thus the updated likelihoods in the state, the posterior distribution on x, is simply propor-tional to the product of prior likelihood and individual likelihoods from each informationsource. The marginal distribution P (Zn) simply acts as a normalising constant. Equa-tion 19 provides a simple and direct mechanism for computing the relative likelihood indifferent values of a state from any number of observations or other pieces of information.

Equation 19 is known as the independent likelihood pool [8]. In practice, the conditionalprobabilities P (zi | x) are stored a priori as functions of both zi and x. When an observa-tion sequence Zn = z1, z2, · · · , zn is made, the observed values are instantiated in thisprobability distribution and likelihood functions Λi(x) are constructed, which are func-tions only of the unknown state x. The product of these likelihood functions with the priorinformation P (x), appropriately normalised, provides a posterior distribution P (x | Zn),which is a function of x only for a specific observation sequence z1, z2, · · · , zn . Figure 1shows the structure of the independent likelihood pool in a centralised architecture.

The effectiveness of Equation 19 relies crucially on the assumption that the informationobtained from different information sources is independent when conditioned on the trueunderlying state of the world; this is defined in Equation 17. It would be right to questionif this assumption is reasonable. It is clearly unreasonable to state that the informationobtained is unconditionally independent;

P (z1, · · · , zn) = P (z1) · · ·P (zn), (20)

because each piece of information depends on a common underlying state x ∈ X . If theinformation obtained were independent of this state, and therefore unconditionally inde-pendent of other information sources, there would be little value in using it to improveknowledge of the state. It is precisely because the information obtained is unconditionallydependent on the underlying state that it has value as an information source. Conversely,it is generally quite reasonable to assume that the underlying state is the only thing incommon between information sources and so once the state has been specified it is corre-spondingly reasonable to assume that the information gathered is conditionally indepen-dent given this state. There are sometimes exceptions to this general rule, particularlywhen the action of sensing has a non-trivial effect on the environment. More complexdependencies are considered in Section 2.2.4

Example 3Consider again Example 2 of the discrete observation of target type. A second sensor is

obtained which makes the same three observations as the first sensor, but whose likelihood


z1

znz i

CentralProcessor

Λ 1(x) Λ i(x) Λ n(x)

P(x)

P(zn|x)

P(x|Zn)=C P( x) Π Λ i(x)n

i=1

P(z i|x)P(z1|x)

X

Figure 1: The centralised implementation of the independent likelihood pool as a methodfor combining information from a number of sources. The central processor maintains amodel of each sensor i in terms of a conditional probability distribution Pi(zi | x), togetherwith any prior probabilistic knowledge P (x). On arrival of a measurement set Zn, eachsensor model is instantiated with the associated observation to form a likelihood Λi(x).The normalised product of these yields the posterior distribution P (x | Zn).

matrix P2(z2 | x) is described by

z1 z2 z3

x1 0.45 0.1 0.45x2 0.1 0.45 0.45x3 0.45 0.45 0.1

Whereas the first sensor was good at detecting targets but not at distinguishing betweendifferent target types, this second sensor has poor overall detection probabilities but goodtarget discrimination capabilities. So for example, with a uniform prior, if we observez = z1 with this second sensor, the posterior distribution on possible true states will begiven by P (x | z1) = (0.45, 0.1, 0.45) (i.e the first column of the likelihood matrix) .

It clearly makes sense to combine the information from both sensors to provide asystem with both good detection and good discrimination capabilities. From Equation 19,the product of the two likelihood functions gives us an overall likelihood function for thecombined system as P12(z1, z2 | x) = P1(z1 | x)P2(z2 | x). Thus if we observe z1 = z1


using the first sensor, and z2 = z1 with the second sensor (assuming a uniform prior),then the posterior likelihood in x is given by

P (x | z1, z1) = αP12(z1, z1 | x)= αP1(z1 | x)P2(z1 | x)= α× (0.45, 0.45, 0.1)⊗ (0.45, 0.1, 0.45)= (0.6924, 0.1538, 0.1538)

Comparing this to taking two observations of z1 with sensor 1 (in which the resultingposterior was (0.488, 0.488, 0.024)) it can be seen that sensor 2 adds substantial targetdiscrimination power at the cost of a slight loss of detection performance for the samenumber of observations.

Repeating this calculation for each z1, z2 observation pair, results in the combinedlikelihood matrix

z1 = z1

z2 = z1 z2 z3

x1 0.6924 0.1538 0.4880x2 0.1538 0.6924 0.4880x3 0.1538 0.1538 0.0240

z1 = z2

z2 = z1 z2 z3

x1 0.6924 0.1538 0.4880x2 0.1538 0.6924 0.4880x3 0.1538 0.1538 0.0240

z1 = z3

z2 = z1 z2 z3

x1 0.1084 0.0241 0.2647x2 0.0241 0.1084 0.2647x3 0.8675 0.8675 0.4706

The combined sensor provides substantial improvements in overall system performance3.If for example we observe target 1 with the first sensor (the array block z1 = z1) and again

3Note that summing over any column still come to 1. In practical implementations, it is often sufficientto encode relative likelihoods of different events and to normalize only when computing the posteriordistribution.


observe target 1 with the second sensor (the first column of this block), then the posteriordistribution in the three hypotheses is

P (x | z1, z2) = (0.692, 0.154, 0.154),

and so target 1 is clearly the most probable target. If however, we observe a type 2target with the second sensor after having observed a type 1 target with the first sensor,a similar calculation gives the posterior as (0.154, 0.692, 0.154), that is target type 2 hashigh probability. This is because although sensor 1 observed a type 1 target, the likelihoodfunction for sensor 1 tells us that it is poor at distinguishing between target types and sosensor 2 information is used for this purpose. If now we observe no target with sensor 2,having detected target type 1 with the first sensor, the posterior given both observationsis given by (0.488, 0.488, 0.024). That is we still believe that there is a target (becausewe know sensor 1 is much better at target detection than sensor 2), but we still have noidea which of target 1 or 2 it is as sensor 2 has been unable to make a valid detection.The analysis for sensor 1 detecting target 2 is identical to that for detection of target 1.Finally, if sensor 1 gets no detection, but sensor 2 detects target type 1, then the posteriorlikelihood is given by (0.108, 0.024, 0.868). That is we still believe there is no target becausewe know sensor 1 is better at providing this information (and perversely, sensor 2 confirmsthis even though it has detected target type 1).

Practically, the joint likelihood matrix is never constructed (it is easy to see why here,with n = 3 sensors, and m = 3 possible observations and k = 3 possible outcomes, thedimension of the joint likelihood matrix has k ×mn = 27 entries.) Rather, the likelihoodmatrix is constructed for each sensor and these are only combined when instantiated withan observation. Storage then reduces to n arrays of dimension k × m, at the cost of ak dimensional vector multiply of the instantiated likelihood functions. This is clearly amajor saving in storage and complexity and underlines the importance of the conditionalindependence assumption to reduction in computational complexity.

2.2.3 Recursive Bayes Updating

The integration of information using Equation 19 would, in principle, require that all pastinformation is remembered and, on arrival of new information in the form P (zk | x), thatthe total likelihood be recomputed based on all information gathered up to this time.However, Bayes theorem, in the form of Equation 19, also lends itself to the incrementalor recursive addition of new information in determining a revised posterior distribution

on the state. With Zk= zk,Zk−1

P (x,Zk) = P (x | Zk)P (Zk) (21)

= P (zk,Zk−1 | x)P (x)

= P (zk | x)P (Zk−1 | x)P (x), (22)


where it is assumed conditional independence of the observation sequence. Equating bothsides of this expansion gives

P (x | Zk)P (Zk) = P (zk | x)P (Zk−1 | x)P (x) (23)

= P (zk | x)P (x | Zk−1)P (Zk−1). (24)

Noting that P (Zk)/P (Zk−1) = P (zk | Zk−1) and rearranging gives

P (x | Zk) = P (zk | x)P (x | Zk−1)

P (zk | Zk−1). (25)

The advantage of Equation 25 is that we need compute and store only the posteriorlikelihood P (x | Zk−1) which contains a complete summary of all past information. Whenthe next piece of information P (zk | x) arrives, the previous posterior takes on the roleof the current prior and the product of the two becomes, when normalised, the newposterior. Equation 25 thus provides a significant improvement in computational andmemory requirements over Equation 19.

Example 4An important example of the recursive application of Bayes theorem is in the calcu-

lation of the posterior distribution of a scalar x under the assumption that the likelihoodfunction for the observations given the true state is Gaussian with known variance σ2;

P (zk | x) = 1√2πσ2

exp

(−1

2

(zk − x)2

σ2

).

If we assume that the posterior distribution in x after taking the first k − 1 observationsis also Gaussian with mean xk−1 and variance σ2

k−1,

P (x | Zk−1) =1√

2πσ2k−1

exp

(−1

2

(xk−1 − x)2

σ2k−1

).

then the posterior distribution in x after the first k observations is given by

P (x | Zk) = K1√2πσ2

exp

(−1

2

(zk − x)2

σ2

).

1√2πσ2

k−1

exp

(−1

2

(xk−1 − x)2

σ2k−1

)

(26)

=1√2πσ2

k

exp

(−1

2

(xk − x)2

σ2k

)(27)

where K is a constant independent of x chosen to ensure that the posterior is appropriatelynormalized, and xk and σ2

k are given by

xk =σ2k−1

σ2k−1 + σ2

zk +σ2

σ2k−1 + σ2

xk−1, (28)


and

σ2k =

σ2σ2k−1

σ2 + σ2k−1

(29)

The most important point to note about the posterior is that it too is a Gaussian; theproduct of two Gaussian distributions is itself Gaussian. Distributions that have this sym-metry property are known as conjugate distributions. Given this property, it is clearlynot necessary to go through the process of multiplying distributions together as it is suffi-cient to simply compute the new mean and variance recursively from Equations 28 and 29as these completely characterize the associated Gaussian distribution. With these simplerecursion equations any number of observations may be fused in a simple manner.

Example 5Quite general prior and likelihood distributions can be handled by direct application of

Bayes Theorem. Consider the problem in which we are required to determine the locationof a target in a defined area. Figure 2(a) shows a general prior probability distribution de-fined on the search area. The distribution is simply defined as a probability value P (xi, yj)on a grid point at location xi, yj, of area dxi, dyj. The only constraints placed on thisdistribution are that

P (xi, yj) > 0, ∀xi, yjand that ∑

i

∑j

P (xi, xj)dxidyi = 1

(which can easily be satisfied by appropriate normalisation.)A sensor (sensor 1) now takes observations of the target from a sensor located at

x = 15, y = 0km. The likelihood function generated from this sensor following an ob-servation z1 is shown in Figure 2(b). This likelihood P1(z1 | xi,yj) again consists of ageneral location probability defined on the xi yj grid. The likelihood shows that the bearingresolution of the sensor is high, whereas it has almost no range accuracy (the likelihood islong and thin with probability mass concentrated on a line running from sensor to target).The posterior distribution having made this first observation is shown in Figure 2(c) andis computed from the point-wise product of prior and likelihood,

P (xi,yj | z1) = α× P1(z1 | xi,yj)⊗ P (xi,yj),

where α is simply a normalising constant. It can be seen that the distribution definingtarget location is now approximately restrained to a line along the detected bearing. Figure2(d) shows the posterior P (xi,yj | z1, z2) following a second observation z2 by the samesensor. This is again computed by point-wise multiplication of the likelihood P1(z2 | xi,yj)with the new prior (the posterior from the previous observation P (xi,yj | z1)). It can beseen that there is little improvement in location density following this second observation;this is to be expected as there is still little range data available.


010

2030

4050

0

10

20

30

40

50

0

2

4

6

x 10−3

X Range (km)

Prior Location Density

Y Range (km)

(a)

010

2030

4050

0

10

20

30

40

50

0

0.01

0.02

0.03

0.04

X Range (km)

Location likelihood from Sensor 1

Y Range (km)

(b)

010

2030

4050

0

10

20

30

40

50

0

0.02

0.04

0.06

X Range (km)

Posterior location density after one observation from sensor 1

Y Range (km)

(c)

010

2030

4050

0

10

20

30

40

50

0

0.02

0.04

0.06

0.08

0.1

X Range (km)

Posterior location density after two observations from sensor 1

Y Range (km)

(d)

010

2030

4050

0

10

20

30

40

50

0

0.01

0.02

0.03

0.04

X Range (km)

Location likelihood from Sensor 2

Y Range (km)

(e)

010

2030

4050

0

10

20

30

40

50

0

0.2

0.4

0.6

0.8

1

X Range (km)

Posterior location density following update from sensor 2

Y Range (km)

(f)

Figure 2: Generalised Bayes Theorem. The figures show plots of two-dimensional distri-bution functions defined on a grid of x and y points: (a) Prior distribution; (b) likelihoodfunction for first sensor; (c) posterior after one application of Bayes Theorem; (d) posteriorafter two applications; (e) likelihood function for second sensor; (f) final posterior.


A second sensor (sensor 2) now takes observations of the target from a location x =50, y = 20. Figure 2(e) shows the target likelihood P2(z3 | xi,yj) following an observationz3 by this sensor. It can be seen that this sensor (like sensor 1), has high bearing resolution,but almost no range resolution. However, because the sensor is located at a different site,we would expect that the combination of bearing information from the two sensors wouldprovide accurate location data. Indeed, following point-wise multiplication of the secondsensor likelihood with the new prior (the posterior P (xi,yj | z1, z2) from the previous twoobservations of sensor 1), we obtain the posterior P (xi,yj | z1, z2, z3) shown in Figure2(f) which shows all probability mass highly concentrated around a single target location.

2.2.4 Data Dependency and Bayes Networks

Bayes Theorem provides a general method for reasoning about probabilities and fusinginformation. As has been seen, in complex problems involving many sources of informa-tion, it is usual to assume conditional independence of observation information (that is,the observations are conditionally independent given the underlying state). This is neces-sary to avoid the problem of specifying very large numbers of likelihoods associated witheach permutation of observations (if there are p possible outcomes for each observationthen the specification of the joint likelihood P (z1, · · · , zn | x) requires the specification ofpn probabilities, whereas if conditional independence is assumed, we need only specify nlikelihoods Pi(zi | x), or a total of p× n probabilities).

In many situations it is not possible to simply assume conditional independence ofobservations, and indeed there is often a clear dependency between sources of information(when each source depends on some underlying assumption or report from a third source,for example). In such cases, it is easy to be swamped by the complexity of the variousdependencies between information reports. However, even in such cases, it is still possibleto make good use of the rules of conditioning to exploit any independence that may exist.Recall that in general, without any assumptions of independence, the joint distributionon any set of random variables xi may be expanded using the chain-rule of conditionalprobabilities as

P (x1,x2, · · · ,xn) = P (x1 | x2, · · · ,xn) · · ·P (xn−1 | xn)P (xn),

where the expansion may be taken in any convenient order. If, all variables depend on allother variables, then there is little to be gained by this expansion. If however a variable xj,for example, depends directly on only one or a few other variables xh, then the appropriateexpansion can substantially reduce the number of probabilities that need be stored,

P (xj,xh, · · · ,xn) = P (xj | xh)P (xh · · ·xn).

Such direct conditional independencies are common if only because the acquisition ofinformation is causal. For example, information about a state xk+1 at time k + 1 willdepend on all information obtained up to time k+1. However, if all past information up to


time k is summarised in the variable xk, then the new variable xk+1 becomes conditionallyindependent of all past information (up to k), once xk is specified (essentially xk tells usall we need to know about past observations and so re-stating these observations adds nomore knowledge; this is the well-known Markov property).

The issue of dependency between variables is often best expressed graphically, interms of a network describing relationships. Here, the relations between different statesare explicitly represented in a graph structure, with the vertices of the graph representingindividual states and their probability distributions, and the edges of the graph represent-ing conditional probabilities of adjacent nodes. In the discrete case, the graph requiresthe specification of a conditional probability matrix of dimension ni × nj for each edgerelating two states xi and xj which can take on respectively ni and nj distinct values.From this a detailed computational scheme can be developed for propagating probabilisticinformation through the graph. These networks are referred to as Bayes Networks, Be-lief Networks or Probabilistic Networks. The landmark book by Pearl [35], describes theconstruction and use of these dependency networks in probabilistic reasoning. A detaileddiscussion of Bayes networks is beyond the scope of this course. The following exampleprovides a brief introduction.

Example 6

z4

z1z2 z3

P(x1|z1,z2) P(x2|z3)

P(y1|x1,x2) P(y2|z4)

P(t|y1,y2)

x1 x2

Figure 3: A Simple Bayesian Network (see text for details).

It is required to determine a target identity t. Target identity probabilities are obtainedfrom two independent sources y1 and y2. The first source bases target reporting on twosensor sources x1 and x2, the first of these two sensors employing a pair of dependentobservations z1, z2, and the second from a single observation z3. The second source y2


bases its report directly on an observation z4. The dependency structure is shown in thebelief network of Figure 3. The conditional probabilities are specified at each of the nodesin the network (tree) structure. It is possible to specify these at the arcs and use the nodesas places where total probabilities (posteriors)are computed.

The structure shown in the figure simply exposes the dependencies between variablesand is a convenient way of describing, computationally, the role of each likelihood. Thetree structure shown is common in causal processes. It is sometimes the case that reportsare based on a common report or observation, in which case the tree becomes a generalnetwork structure. These cases are difficult to deal with other than explicitly computingthe resulting dependence between report variables.

2.2.5 Distributed Data Fusion with Bayes Theorem

...... ......

X

z1 znz i

Fusion Centre

P(z i|x)

Sensor i

P(z1|x)

Sensor 1

P(zn|x)

Sensor n

Λ 1(x) Λ i(x) Λ n(x)

P(x|Zn)=C P( x) Π Λ i(x)n

i=1

P(x)

Figure 4: The distributed implementation of the independent likelihood pool. Each sensormaintains it’s own model in the form of a conditional probability distribution Pi(zi | x).On arrival of a measurement, zi, the sensor model is instantiated with the associatedobservation to form a likelihood Λi(x). This is transmitted to a central fusion centre werethe normalised product of likelihoods and prior yields the posterior distribution P (x | Zn).

Providing the essential rules of conditional probabilities and Bayes Theorem are fol-lowed, it is not difficult to develop methods for distributing the data fusion problem.Figure 4 shows one such distributed architecture. In this case, the sensor models, in the


form of likelihood functions, are maintained locally at each sensor site. When an obser-vation is made, these likelihoods are instantiated to provide a likelihood function Λi(x)describing a probability distribution over the true state of the world. Importantly, it isthis likelihood that is transmitted to the fusion centre. Thus, the sensor talks to the cen-tre in terms of the underlying state of the world, and not in terms of the raw observation(as is the case in Figure 1). This has the advantage that each sensor is ‘modular’ andtalks in a common ‘state’ language. However, it has the disadvantage that a completelikelihood, rather than a single observation, must be communicated. The central fusioncenter simply computes the normalised product of communicated likelihoods and prior toyield a posterior distribution.

......

X

z1 zn

Fusion Centre

P(z1|x)

Sensor 1

P(xk|z1)

P1(xk|z)

P(zn|x)

Sensor n

P(xk|zn)

Pn(xk|z)P(xk|Zn)

P(xk|Zn)=C P( xk-1 ) Π

n

i=1

Pi(xk|z)

P(xk-1|Zn)

k k-1

Figure 5: A distributed implementation of the independent opinion pool in which eachsensor maintains both it’s own model and also computes a local posterior. The completeposterior is made available to all sensors and so they become, in some sense, autonomous.The figure shows Bayes Theorem in a recursive form.

A second approach to distributed data fusion using Bayes Theorem is shown in Figure5. In this case, each sensor computes a likelihood but then combines this, locally, withthe prior from the previous time-step, so producing a local posterior distribution on thestate. This is then communicated to the central fusion centre. The fusion centre thenrecovers the new observation information by dividing each posterior by the communicated(global) prior and then taking a normalised product to produce a new global posterior.This posterior is then communicated back to the sensors and the cycle repeats in recursiveform. The advantage of this structure is that global information, in state form, is made


available at each local sensor site. This makes each sensor, in some sense, autonomous.The disadvantage of this architecture is the need to communicate state distributions bothto and from the central fusion center. Distributed data fusion methods will be discussedin detail later in this course.

z

P(x -k+1 |Zk)

P(z|x)

P(x -|x+)

P(x+k|Z

k)

P(x -k|Z

k-1)k k-1

Figure 6: A Bayesian formulation of the classical feedback aided navigation system, fusingrate data with external observation information (see text for details).

Almost any data fusion structure is possible as long as the essential rules of conditionalprobability and Bayes theorem are followed. Figure 6 shows a Bayes filter in feedbackform. The filter outputs state predictions in the form of a pdf P (x−

k | Zk−1) of the statex−k at time k given all observations Zk−1 up to time k − 1. This is essentially a prior

distribution and provides a distribution on the state at time k prior to an observation beingmade at this time (and thus is marked with a ‘−’ superscript to denote ‘before update’).The prior is fed-back and multiplied by the likelihood Λ(x) associated with an incomingobservation. This multiplication essentially implements Bayes Theorem to produce theposterior P (x+

k | Zk), where the superscript ‘+’ denotes ‘after observation’. The likelihooditself is generated from the sensor model P (z | x) in the normal manner. The posterioris fed-back and multiplied by another distribution P (x− | x+) which essentially predictsthe future state on the basis of current state. Multiplying this distribution by the fed-back posterior generates a new prior. The cycle, after delay, then repeats. This structureis a Bayesian implementation of the classical feed-back aided navigation system. Thestate prediction distribution P (x− | x+) has the same role as an inertial system, usingintegrated rate information to generate predictions of state. The feedback loop is correctedby reference to some external observation (the aiding sensor) modeled by the likelihoodfunction P (z | x). Thus rate and external information are fused in a predictor-correctorarrangement.

2.2.6 Data Fusion with Log-Likelihoods

In data fusion problems and in Bayes networks, it is often easier to work with the log of aprobability, rather than the probability itself. These ‘log-likelihoods’ are more convenientcomputationally than probabilities as additions and subtractions rather than multiplica-tions and divisions are employed in fusing probability information. Further, log-likelihoodsare also more closely related to formal definitions of information.


The log-likelihood or conditional log-likelihood are defined as;

l(x)= logP (x), l(x | y)

= logP (x | y). (30)

It is clear that the log-likelihood is always less than zero and is only equal to zero whenall probability mass is assigned to a single value of x; l(x) ≤ 0. The log-likelihood itselfcan sometimes be a useful and efficient means of implementing probability calculations.For example, taking logs of both sides of Equation 12 we can rewrite Bayes theorem interms of log-likelihood as

l(x | z) = l(z | x) + l(x)− l(z). (31)

Example 7Consider again the two-sensor discrete target identification example (Example 3). The

log-likelihood matrix for the first sensor (using natural logs) is

z1 z2 z3

x1 −0.799 −2.303 −0.799x2 −2.303 −0.799 −0.799x3 −0.799 −0.799 −2.303

and for the secondz1 z2 z3

x1 −0.799 −2.303 −0.799x2 −2.303 −0.799 −0.799x3 −0.799 −0.799 −2.303

The posterior likelihood (given a uniform prior) following observation of target 1 by sensor1 and target 1 by sensor 2 is the sum of the first columns of each of the likelihood matrices

l(x | z1, z1) = l1(z1 | x) + l2(z1 | x) + C= (−0.7985,−0.7985,−2.3026) + (−0.7985,−2.3026,−0.7985) + C= (−1.5970,−3.1011,−3.1011) + C= (−0.3680,−1.8721,−1.8721)

where the constant C = 1.229 is found through normalisation (which in this case requiresthat the anti-logs sum to one). The first thing to note is that the computation is obviouslysimpler than in the case of obtaining products of probability distributions, particularly asthe dimension of the state vectors increase. The second thing is that the normalisingconstant is additive. Thus, normalisation need only occur if probabilities are required,otherwise, the relative magnitude of the log-likelihoods are sufficient to indicate relativelikelihoods (the smaller the log-likelihood, the nearer the probability is to one).


Example 8A second interesting example of log-likelihoods is in the case where prior and sensor

likelihoods are Gaussian as in Example 4. Taking logs of Equation 27, gives

l(x | Zk) = −1

2

(xk − x)2

σ2k

= −1

2

(zk − x)2

σ2− 1

2

(xk−1 − x)2

σ2k−1

+ C. (32)

as before, completing squares gives

xk =σ2k−1

σ2k−1 + σ2

zk +σ2

σ2k−1 + σ2

xk−1,

and

σ2k =

σ2σ2k−1

σ2 + σ2k−1

,

thus the log-likelihood is quadratic in x; for each value of x, a log-likelihood is specified as

−12

(xk−x)2

σ2k

, modulo addition of a constant C.

1

N

2

Σ

log P( x|Zk-1)

log P( x|Zk)

log P(z 1(k)| x)

log P(z N(k)| x)

log P(z 2(k)| x)

z1(k)

z2(k)

zN(k)

k k-1

Sensor

Central Processor Sensor Models

Figure 7: A log-likelihood implementation of a fully centralised data fusion architecture.

Log-likelihoods are a convenient way of implementing distributed data fusion archi-tectures. Figure 7 shows a log-likelihood implementation of a fully centralised data fusionarchitecture (equivalent to Figure 1) in which observations are transmitted directly toa central fusion centre which maintains the sensor likelihood functions. Fusion of infor-mation is simply a matter of summing log-likelihoods. Figure 8 shows a log-likelihoodimplementation of the independent likelihood architecture (equivalent to Figure 4). Inthis case, each sensor maintains its own model and communicates log-likelihoods to a cen-tral processor. The fusion of log-likelihoods is again simply a summation. Figure 9 showsa log-likelihood implementation of the independent opinion pool (equivalent to Figure 5)


1

N

2

Σ

log P( x|Zk-1 )

log P( x|Zk)

log P(z1(k)| x)

log P(z N(k)| x)

log P(z2(k)| x)

z1(k)

z2(k)

zN(k)

k k-1

Sensor

Central Processor

Sensor Models

Figure 8: A log-likelihood implementation of the independent likelihood pool architecture.

Central Processor

Σ

log P(z1(k)| x)

z1(k)

k k-1

log P( x|Z k1)

Sensor 1

Σ

log P(zN(k)| x)

zN(k)

k k-1

log P( x|Z kN)

Sensor N

Σ

log P(z2(k)| x)

z2(k)

k k-1

log P( x|Z k2)

Sensor 2

Σlog P( x|Z k)

Figure 9: A log-likelihood implementation of the independent opinion pool architecture.


in which each local sensor maintains a posterior likelihood which is communicate to thecentral fusion center. In all cases, the log-likelihood implementation involves only simpleaddition and subtraction in the form of classical feed-back loops. These can be easilymanipulated to yield a wide variety of different architectures.

2.3 Information Measures

Probabilities and log-likelihoods are defined on states or observations. It is often valuableto also measure the amount of information contained in a given probability distribution.Formally, information is a measure of the compactness of a distribution; logically if aprobability distribution is spread evenly across many states, then it’s information contentis low, and conversely, if a probability distribution is highly peaked on a few states, thenit’s information content is high. Information is thus a function of the distribution, ratherthan the underlying state. Information measures play an important role in designingand managing data fusion systems. Two probabilistic measures of information are ofparticular value in data fusion problems; the Shannon information (or entropy) and theFisher information.

2.3.1 Entropic Information

The entropy or Shannon information4 HP (x) associated with a probability distributionP (x), defined on a random variable x, is defined as the expected value of minus thelog-likelihood. For continuous-valued random variables this is given by (see [34] Chapter15)

HP (x)= − ElogP (x) = −

∫ ∞

−∞P (x) logP (x)dx (33)

and for discrete random variables

HP (x)= − ElogP (x) = − ∑

x∈XP (x) logP (x). (34)

Note that following convention, we have used x as an argument for HP (·) even thoughthe integral or sum is taken over values of x so HP (·) is not strictly a function of x butis rather a function of the distribution P (·).

The entropy HP (·) measures the compactness of a density P (·). It achieves a minimumof zero when all probability mass is assigned to a single value of x; this agrees with anintuitive notion of a ‘most informative’ distribution. Conversely, when probability massis uniformly distributed over states, the entropy in the distribution is a maximum; thistoo agrees with the idea of least informative (maximum entropy) distributions. Maximumentropy distributions are often used as prior distributions when no useful prior informationis available. For example, if the random variable x can take on at most n discrete values

4Properly, information should be defined as the negative of entropy; when entropy is a minimum,information is a maximum. As is usual, we shall ignore this distinction and usually talk about entropyminimization when we really mean information maximisation.


in the set X , then the least informative (maximum entropy) distribution on x is onewhich assigns a uniform probability 1/n to each value. This distribution will clearly havean entropy of log n. When x is continuous-valued, the least informative distribution isalso uniform. However, if the range of x is continuous then the distribution is technicallynot well defined as probability mass must be assigned equally over an infinite range. Ifthe information is to be used as a prior in Bayes rule this technical issue is often not aproblem as P (x) can be set to any convenient constant value over the whole range ofx;5 P (x) = 1,∀x, for example, without affecting the values computed for the posteriorP (x | z).

It can be shown that, up to a constant factor and under quite general conditions ofpreference ordering and preference boundedness, this definition of entropy is the only rea-sonable definition of ‘informativenss’. An excellent proof of this quite remarkable result(first shown by Shannon) can be found in [14]. The implications of this in data fusionproblems are many-fold. In particular it argues that entropy is a uniquely appropri-ate measure for evaluating and modeling information sources described by probabilisticmodels. Such ideas will be examined in later sections on system performance measures,organization and management.

2.3.2 Conditional Entropy

The basic idea of entropy can logically be extended to include conditional entropy; forcontinuous-valued random variables

HP (x | zj) = − ElogP (x | zj) = −

∫ ∞

−∞P (x | zj) logP (x | zj)dx (35)

and for discrete random variables

HP (x | zj) = − ElogP (x | zj) = −∑

xP (x | zj) logP (x | zj). (36)

This should be interpreted as the information (entropy) about the state x contained inthe distribution P (· | z) following an observation zj. Note that HP (x | z) is still a functionof z, and thus depends on the observation made.

The mean conditional entropy, H(x | z), taken over all possible values of z, is givenby

H(x | z) = EH(x | z)=

∫ +∞

−∞P (z)H(x | z)dz

= −∫ +∞

−∞

∫ +∞

−∞P (z)P (x | z) logP (x | z)dxdz

= −∫ +∞

−∞

∫ +∞

−∞P (x, z) logP (x | z)dxdz, (37)

5A distribution such as this which clearly violates the constraint that∫ +∞−∞ P (x)dx = 1 is termed an

improper distribution, or in this case an improper prior.


for continuous random variables and

H(x | z) =

∑zH(x | z)P (z)

= −∑z

∑xP (x, z) logP (x | z). (38)

for discrete random variables. Note that H(x | z) is not a function of either x or z. Itis a essentially a measure of the information that will be obtained (on the average) bymaking an observation before the value of the observation is known.

Example 9Recall example 3 of two sensors observing a discrete state; type 1 target, type 2 target,

and no target. Consider first sensor 1. Assuming a uniform prior distribution, the infor-mation (entropy) obtained about the state given that an observation z = z1 has been made(first column of the likelihood matrix for sensor 1), is simply given by (using natural logs)

HP1(x | z1) = −3∑i=1

P1(xi | z1) logP1(xi | z1) = 0.9489. (39)

For the first sensor, the conditional entropy for each possible observation is

HP1(x | z) = −3∑i=1

P1(xi | z) logP1(xi | z) = (0.9489, 0.9489, 0.6390). (40)

Thus, observing either z1 or z2 is equally as informative (indeed it provides exactly thesame information as sensor 1 can not distinguish between targets), but observing z3 (notarget) is most informative because probability mass is relatively concentrated on the notarget state. Successive observation of the z1 or z2 with sensor 1, yields a posterior den-sity on x of (0.5, 0.5, 0.0) which has an information value (entropy) of log(2) = 0.6931.Successive observation of z3 yields a posterior of (0.0, 0.0, 1.0) which has an entropy ofzero (log(1), the minimum entropy).

Similar calculations for the likelihood matrix of sensor 2 give the conditional entropyfor each observation as HP2(x | z) = (0.948, 0.948, 0.948); that is any observation is equallyas informative. This is because, in the likelihood function for sensor 2, the relative distri-bution of probability mass is the same for any observation, even though the mass itself isplaced on different states. Successive observation of any of z1, z2, or z3 results in probabil-ity being assigned to only two of three states and thus for the posterior entropy to achievea minimum of log(2) = 0.6331. Note that, as with sensor 1 observing z1 or z2, entropyachieves a lower bound not equal to zero.

To find the mean conditional entropy, the joint distribution P (x, y) is required. This


is computed fromP (x, z) = P (z | x)P (x)

=

0.45 0.45 0.10.45 0.45 0.10.1 0.1 0.8

⊗

1/31/31/3

=

0.1500 0.1500 0.03330.1500 0.1500 0.03330.0333 0.0333 0.2667

,

(Note the sum of elements is equal to 1). Substituting these values into Equation 38 andsumming over both x and z gives the mean conditional entropy as 0.8456.

Example 10It is of considerable future interest to provide a measure for the entropy of a Gaussian

(normal) distribution. The pdf for an n-dimensional Gaussian is given by:

P (x) = N(x,P) = | 2πP |−1/2 exp[−1

2(x− x)TP−1(x− x)

], (41)

where x is the mean of the distribution and P the covariance, and where | · | denotesmatrix determinant. The entropy for this distribution is obtained as follows

HP (x) = ElogP (x)= −1

2E(x− x)TP−1(x− x) + log[(2π)n| P |]

= −1

2E∑ij(xi − xi)P

−1ij (xj − xj) − 1

2log[(2π)n| P |]

= −1

2

∑ij

E(xj − xj)(xi − xi) P−1ij −

1

2log[(2π)n| P |]

= −1

2

∑j

∑i

PjiP−1ij −

1

2log[(2π)n| P |]

= −1

2

∑j

(PP−1)jj − 1

2log[(2π)n| P |]

= −1

2

∑j

1jj − 1

2log[(2π)n| P |]

= −n

2− 1

2log[(2π)n| P |]

= −1

2log[(2πe)n| P |]. (42)

Thus the entropy of a Gaussian distribution is defined only by the state vector length nand the covariance P . The entropy is proportional to the log of the determinant of the


covariance. The determinant of a matrix is a volume measure (recall that the determinantis the product of the eigenvalues of a matrix and the eigenvalues define axis lengths in nspace). Consequently, the entropy is a measure of the volume enclosed by the covariancematrix and consequently the compactness of the probability distribution.

If the Gaussian is scalar with variance σ2, then the entropy is simply given by

H(x) = log σ√2πe

For a two random variables x and z which are jointly Gaussian, with correlation ρzx =σxz√σ2

xxσ2zz

, the conditional entropy is given by

H(x | z) = log σxx√2πe(1− ρ2

xz).

Thus when the variables are uncorrelated, ρxz = 0, the conditional entropy is just theentropy in P (x), H(x | z) = H(x), as z provides no additional information about x.Conversely, when the variables are highly correlated ρxz → 1, the conditional entropy goesto zero as complete knowledge of z implies complete knowledge of x.

2.3.3 Mutual Information

With these definitions of entropy and conditional entropy, it is possible to write an ‘infor-mation form’ of Bayes theorem. Taking expectations of Equation 31 with respect to boththe state x and the observation z gives (we will now drop the suffix P when the contextof the distribution is obvious)

H(x | z) = H(z | x) +H(x)−H(z). (43)

Simply, this describes the change in entropy or information following an observation froma sensor modeled by the likelihood P (z | x).

Being able to describe changes in entropy leads naturally to asking an importantquestion; what is the most informative observation I can make? This question may beanswered through the idea of mutual information.

The mutual information I(x, z) obtained about a random variable x with respect toa second random variable z is now defined as

I(x, z) = −Elog P (x, z)P (x)P (z)

= −Elog P (x | z)P (x)

= −Elog P (z | x)P (z)

(44)

Mutual information is an a priori measure of the information to be gained through ob-servation. It is a function of the ratio of the density P (x | z) following an observation to


the prior density P (x). The expectation is taken over z and x, so the mutual informationgives an average measure of the gain to be expected before making the observation. If theunderlying probability distributions are continuous then

I(x, z) = −∫ +∞

−∞

∫ +∞

−∞P (x, z) log

P (x, z)

P (x)P (z)dxdz, (45)

and for discrete distributions

I(x, z) = − ∑z∈Z

∑x∈X

P (x, z) logP (x, z)

P (x)P (z). (46)

Noting thatP (x, z)

P (x)P (z)=

P (x | z)P (x)

, (47)

then if x and z are independent, the expressions in Equation 47 become equal to one and(taking logs) the mutual information becomes equal to zero. This is logical; if knowledgeof the state is independent of the observation, the information to be gained by taking anobservation (the mutual information) is zero. Conversely, as x becomes more dependent onz, then P (x | z) becomes more peaked or compact relative to the prior distribution P (x)and so mutual information increases. Note that mutual information is always positive (itis not possible to lose information by taking observations).

Equation 44 can be written in terms of the component entropies as

I(x, z) = H(x) +H(z)−H(x, z)

= H(x)−H(x | z)= H(z)−H(z | x) (48)

Equation 48 measures the ‘compression’ of the probability mass caused by an observation.Mutual information provides an average measure of how much more information we wouldhave about the random variable x if the value of z where known. Most importantly mutualinformation provides a pre-experimental measure of the usefulness of obtaining information(through observation) about the value of z.

Example 11Continuing the two sensors target identification example (Example 9), it is interesting

to see what value mutual information has in deciding which sensor should be used forcorrectly identifying a target (this is a sensor management problem).

First assume that the prior information on x is uniform so that

P (x) = [1/3, 1/3, 1/3]T .

Consider taking an observation with sensor 1. The total probability of observation P (z)can be found by summing the joint probability P (x, z) (computed in Example 9) to obtain


P (z) = [1/3, 1/3, 1/3]. The mutual information, I(x, z), on using sensor 1 to take anobservation is then

I(x, z) = −∑j P (zj) logP (zj)−(−∑j∑i P (xi, zj) logP (zj | xi)

)= 1.0986− 0.8456 = 0.2530.

Suppose we now go ahead and take an observation with sensor 1 and the result is z1, anobservation of target 1. The prior is now updated as (Example 2)

P (x) = P1(x | z1) = [0.45, 0.45, 0.1].

The mutual information gain from using sensor 1 given this new prior is I(x, z) = 0.1133.Note, that this is smaller than the mutual information obtained with a uniform prior.This is because we now have observation information and so the value of obtaining moreinformation is less. Indeed, mutual information shows that the more we observe, the lessvalue there is in taking a new observation. In the limit, if we repeatedly observe z1 withsensor 1, then the prior becomes equal to [0.5, 0.5, 0.0] (Example 2). In this case thepredicted mutual information gain is zero. That is, there is no value in using sensor 1,yet again, to obtain information.

Similarly, again assuming a uniform prior, the mutual information gain predictedfor using sensor 2 is 0.1497.This is lower than sensor 1, when the no-target density ismore highly peaked. If we observe target 1 with sensor 2 then the updated prior becomesP (x) = [0.45, 0.1, 0.45]. Using this prior, the mutual information gain for using sensor2 is only 0.1352; smaller, but not as small as the corresponding mutual information gainfrom sensor 1 after the first measurement. Continual observation of target 1 with sensor2 results in the posterior becoming equal to [0.5, 0.0, 0.5] and if, this is substituted intothe equation for mutual information gain, we obtain a predicted mutual information gainof 0.1205; even after taking an infinite number of measurements. The reason for thisis subtle; as sensor 2 can not distinguish targets, any observation other than that oftarget 1 will still provide additional information and as mutual information is an averagemeasure, it is still predicts, on the average, an increase in information (even though furtherobservation of target 1 will not provide any more information). Continual observation oftarget 1 with sensor 2 would never happen in practice. As the likelihood suggests, if thetrue state were target 1, then sensor 2 would be as likely to not to detect a target at allas report a target 1. Indeed, if we mix detections as the likelihood suggests, the posteriortends to [1, 0, 0] or [0, 1, 0] which then suggests zero mutual information gain from sensor2.

Now, suppose sensor 1 is used to obtain the first measurement and target 1 is detectedso that the new prior is [0.45, 0.45, 0.1]. Which of sensor 1 or 2 should be used to makethe second observation ? With this prior, the predicted mutual information gain fromusing sensor 1 is 0.1133, and from using sensor 2 is 0.1352. This (logically) tells usto use sensor 2. Now sensor 2 is equally as likely to detect the correct target or returna no-detect. Sensor 2 is now employed and a no-detect is returned yielding a posterior[0.4880, 0.4880, 0.0241]. With this as a prior, mutual information will tell us to continue


with sensor 2 until we get a return for either target 1 or 2. If this (target 2, say) happensnext time round, then the posterior will be [0.1748, 0.7864, 0.0388]; we now have, for thefirst time a target preference (2) and sensor 1 and 2 can both be used to refine this estimate.With this prior, the mutual information for sensor 1 is 0.0491 and for sensor 2 is 0.0851;thus sensor 2 will be used on the fourth sensing action.

This process of predicting information gain, making a decision on sensing action, thensensing, is an efficient and effective way of managing sensing resources and of determiningoptimal sensing policies. However, it should be noted that this is an average policy. Aswe saw with the behaviour of sensor 2, it is sometimes not the most logical policy if wevalue things in a manner other than on the average. This issue will be dealt with in moredetail in the section on decision making.

Example 12A second interesting application of mutual information gain is Example 5. In this

example, general probability distributions, defining target location, are defined on a locationgrid. Entropy measures can be obtained for distributions on this grid from

H(x,y) = −ElogP (x,y)

=∫ ∫

P (x,y) logP (x,y)dxdy

≈ ∑∑P (xi, yj) logP (xi, yj)δxiδyj

(49)

The total entropy associated with the prior shown in Figure 2(a), computed with Equation49 is H(x) = 12.80. The total entropy associated with the sensor 1 likelihood shown inFigure 2(b) is H(z | x) = 4.2578, and the entropy associated with the resulting posteriorof Figure 2(c) is H(x | x) = 3.8961. The predicted mutual information gain from thisobservation is therefore I(x, z) = 8.9039 (a big gain). However, the predicted mutualinformation gain from a second observation using sensor 1 is only 0.5225. This is reflectedin the slight change in posterior from Figure 2(c) to Figure 2(d). Conversely, the predictedmutual information gain from sensor 2, with likelihood shown in Figure 2(e), is 2.8598 (arelatively large gain). This is reflected in the posterior shown in Figure 2(f) which has anentropy of only 0.5138.

This example shows that it is straight-forward to apply principles of entropy to anyprobability type of probabilistic information; to measure compactness of information andthe prediction of information gain using mutual information.

2.3.4 Fisher Information

A second measure of information commonly used in probabilistic modeling and estimationis the Fisher information. Unlike Shannon information, Fisher information may only be


defined on continuous distributions. The Fisher information J(x) is defined as the secondderivative of the log-likelihood6

J(x) =d2

dx2logP (x). (50)

In general, if x is a vector, then J(x) will be a matrix, usually called the Fisher InformationMatrix. The Fisher information describes the information content about the values ofx contained in the distribution P (x). The Fisher information measures the surface of abounding region containing probability mass. Thus, like entropy, it measures compactnessof a density function. However, entropy measures a volume and is thus a single number,whereas Fisher information is a series of numbers (and generally a matrix) measuring theaxes of the bounding surface.

The Fisher information is particularly useful in the estimation of continuous valuedquantities. If P (x) describes all the (probabilistic) information we have about the quantityx, then the smallest variance that we can have on an estimate of the true value of x isknown as the Cramer-Rao lower bound, and is equal to the Fisher information J(x). Anestimator that achieves this lower bound is termed ‘efficient’.

Example 13The simplest example of Fisher information is the information associated with a vector

x known to be Gaussian distributed with mean x and covariance P;

P (x) = N(x;x;P) =1

(2π)n/2|P | exp(−1

2(x− x)P−1(x− x)T

)(51)

Taking logs of this distribution, and differentiating twice with respect to x gives J(x) =P−1. That is, the Fisher information is simply the inverse covariance. This agrees withintuition; Gaussian distributions are often drawn as a series of ellipsoids containing prob-ability mass. P−1 describes the surface of this ellipsoid, the square root of it’s eigenvaluesbeing the dimensions of each axis of the ellipsoid.

The Fisher information plays an important role in multi-sensor estimation problems.In conventional estimation problems when only a single source of information is used toobtain information about an unknown state, it is common to talk of an estimate of thisstate together with some associated uncertainty (normally a variance). However, in multi-sensor estimation problems it is difficult to describe any (statistical) relations there maybe between the different estimates produced by different combinations of sensors. Thisproblem can only be overcome by dealing directly with the likelihood functions associ-ated with the observations themselves and by explicitly accounting for any dependenciesbetween different estimates. The Fisher information provides a direct means of account-ing for these dependencies as it makes explicit the information available in the likelihoodfunction. We will return to this again in the Chapter on multisensor estimation.

6The first derivative of the log-likelihood is called the score function


2.3.5 The Relation between Shannon and Fisher Measures

The question arises that if entropy is considered to be the only reasonable measure ofinformation content in a distribution, why consider Fisher information at all as it must,by definition, be an ‘unreasonable’ measure of information. Fortunately, there is a sensi-ble explanation for this problem, in that for continuous variables, Fisher information andShannon information are indeed related by the log-likelihood function. Broadly, entropyis related to the volume of a set (formally a ‘typical set’) containing a specified proba-bility mass. Fisher information is related in the surface area of this typical set. Thusmaximization of Fisher information is equivalent to minimization of entropy. A detaileddevelopment of this relation using the Asymptotic Equipartion Property (AEP) is givenin Cover [16].

Example 14If the pdf associated with a random variable x is known to be Gaussian distributed

with mean x and covariance P (as Equation 51), then an explicit relation for the relationbetween Fisher and Shannon information can be obtained, and indeed is given in Equation42 (Example 10) as;

H(x) = −1

2log[(2πe)n| P |]

This clearly shows the relation between the Fisher surface measure of information P−1

and the entropic volume measure through the determinant of P.

2.4 Alternatives to Probability

The representation of uncertainty is so important to the problem of information fusion thata number of alternative modeling techniques have been proposed to deal with perceivedlimitations in probabilistic methods7.

There are four main perceived limitations of probabilistic modeling techniques:

1. Complexity: the need to specify a large number of probabilities to be able to applyprobabilistic reasoning methods correctly.

2. Inconsistency: the difficulties involved in specifying a consistent set of beliefs interms of probability and using these to obtain consistent deductions about states ofinterest.

3. Precision of Models: the need to be precise in the specification of probabilities forquantities about which little is known.

4. Uncertainty about uncertainty: the difficulty in assigning probability in the face ofuncertainty, or ignorance about the source of information.

7In the literature on the subject, there appears to be no middle ground; you are either a religiousBayesian or a rabid non-conformist.


There are four main techniques put forward to address these issues; interval calculus,fuzzy logic, and the theory of evidence (Dempster-Shafer methods). We briefly discusseach of these in turn.

2.4.1 Interval Calculus

The representation of uncertainty using an interval to bound true parameter values hasa number of potential advantages over probabilistic techniques. In particular, intervalsprovide a good measure of uncertainty in situations where there is a lack of probabilisticinformation, but in which sensor and parameter error is known to be bounded. In intervaltechniques, the uncertainty in a parameter x is simply described by a statement that thetrue value of the state x is known to be bounded from below by a, and from above by b;x ∈ [a, b]. It is important that no other additional probabilistic structure is implied, inparticular the statement x ∈ [a, b] does not necessarily imply that x is equally probable(uniformly distributed) over the interval [a, b].

There are a number of simple and basic rules for the manipulation of interval errors.These are described in detail in the book by Moore [30] (whose analysis was originallyaimed at understanding limited precision computer arithmetic). Briefly, with a, b, c, d ∈ ,addition, subtraction, multiplication and division are defined by the following algebraicrelations;

[a, b] + [c, d] = [a+ c, b+ d] (52)

[a, b]− [c, d] = [a− d], b− c] (53)

[a, b]× [c, d] = [min(ac, ad, bc, bd),max(ac, ad, bc, bd)] (54)

[a, b]/[c, d] = [a, b]× [1/d, 1/c], 0 ∈ [c, d]. (55)

Interval addition and multiplication are both associative and commutative; with intervalsA,B,C ⊂ ,

A+ (B + C) = (A+B) + C, A× (B × C) = (A×B)× C, (56)

A+B = B + A, A×B = B × A. (57)

The distributive law does not always hold, however a weaker law of subdistributivity canbe applied on the basis that an interval is simply a set of real numbers;

A× (B + C) ⊆ A×B + A× C, (58)

with equality when A = [a, a] = a (a single number), or when B×C > 0 (the intervals Band C contain numbers of the same sign.

Interval arithmetic admits an obvious metric distance measure;

d([a, b], [c, d]) = max(|a− c|, |b− d|) (59)

It follows that

d(A,B) = d(B,A), d(A,B) = 0 iff A = B, d(A,B) + d(B,C) ≥ d(A,C) (60)


Matrix arithmetic using intervals is also possible, but substantially more complex, partic-ularly when matrix inversion is required.

Interval calculus methods are sometimes used for detection. However, they are notgenerally used in data fusion problems as: i) it is difficult to get results that converge toanything of value (it is too pessimistic) and; ii) it is hard to encode dependencies betweenvariables which are at the core of many data fusion problems.

2.4.2 Fuzzy Logic

Fuzzy logic has found wide-spread popularity as a method for representing uncertaintyparticularly in applications such as supervisory control and high-level data fusion tasks. Itis often claimed that fuzzy logic provides an ideal tool for inexact reasoning, particularlyin rule-based systems. Certainly, fuzzy logic has had some notable success in practicalapplication. However, the jury is still out on weather fuzzy logic offers more or less thanconventional probabilistic methods.

A great deal has been written about fuzzy sets and fuzzy logic (see for example [17]and the discussion in [9] Chapter 11). Here we briefly describe the main definitions andoperations without any attempt to consider the more advanced features of fuzzy logicmethods.

Consider a Universal set consisting of the elements x; X = x . Consider a propersubset A ⊆ X such that

A = x | x has some specific property .

In conventional logic systems, we can define a membership function µA(x) which reportsif a specific element x ∈ X is a member of this set:

A µA(x) = 1 if x ∈ A

0 if x ∈ A

For example X may be the set of all aircraft. The set A may be the set of all supersonicaircraft. In the fuzzy logic literature, this is known as a “crisp” set.

In contrast, a fuzzy set is one in which there is a degree of membership, ranging between0 and 1. A fuzzy membership function µA(x) then defines the degree of membership ofan element x ∈ X in the set A. For example, if X is again the set of all aircraft, A maybe the set of all “fast” aircraft. Then the fuzzy membership function µA(x) assigns avalue between 0 and 1 indicating the degree of membership of every aircraft x to this set.Formally

A µA → [0, 1].

Figure 10 shows an instance of this fuzzy membership function. Clearly, despite the“fuzzy” nature of the set, it must still be quantified in some form.

Composition rules for fuzzy sets follow the composition processes for normal crisp sets,specifically


0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Fuzzy Membership Function for the Fast Aircraft Set

Maximum Speed m/s

Mem

bers

hip

of S

et

Figure 10: Example fuzzy membership function

AND is implemented as a minimum:

A ∩ B µA∩B(x) = min [µA(x), µB(x)]

OR is implemented as a maximum:

A ∪ B µA∪B(x) = max [µA(x), µB(x)]

NOT is implemented as a compliment:

B µB(x) = 1− µB(x)

The normal properties associated with binary logic now hold; commutativity, associativity,idempotence, distributivity De Morgan’s law and absorption. The only exception is thatthe law of the excluded middle is no longer true

A ∪A = X , A ∩A = ∅

Together these definitions and laws provide a systematic means of reasoning about inexactvalues.

The relationship between fuzzy set theory and probability is still hotly debated. Itshould however be noted that the minimum and maximum operations above are highlysuggestive of the results provided by probability theory when the two sets are either whollydependent or wholly independent.


2.4.3 Evidential Reasoning

Evidential reasoning (often called the Dempster-Shafer theory of evidence after the orig-inators of these ideas) has seen intermittent success particularly in automated reasoningapplications.

Evidential reasoning is qualitatively different from either probabilistic methods orfuzzy set theory in the following sense: Consider a universal set X . In probability theoryor fuzzy set theory, a belief mass may be placed on any element xi ∈ X and indeed on anysubset A ⊆ X . In evidential reasoning, belief mass can not only be placed on elementsand sets, but also sets of sets. Specifically, while the domain of probabilistic methodsis all possible subsets, the domain of evidential reasoning is the domain of all sets of allsubsets.

Example 15Consider the set X = rain, no rain . The set is complete and mutually exclusive (at

least about the event rain). In probability theory, by way of a forecast, we might assign aprobability to each possible event. For example, P (rain) = 0.3, and thus P (no rain) = 0.7.

In evidential reasoning, we construct the set of all sets, otherwise called the power setas

2X = rain, no rain , rain , no rain , ,and belief mass is assigned to all elements of this set as

m(rain, no rain ) = 0.5m(rain ) = 0.3

m(no rain ) = 0.2m( ) = 0.0

(traditionally, the empty set is assigned a belief mass of zero for normalisation purposes).The interpretation of this is that there is a 30% chance of rain, a 20% chance of no rainand a 50% chance of either rain or no rain. In effect, the measure placed on the setcontaining both rain and no rain, is a measure of ignorance or inability to distinguishbetween the two alternatives.

Evidential reasoning thus provides a method of capturing ignorance or an inabilityto distinguish between alternatives. In probability theory, this would be dealt with in avery different manner by assigning an equal or uniform probability to each alternative.Yet, stating that there is a 50% chance of rain is clearly not the same as saying that it isunknown if it will rain or not.

The use of the power set as the “frame of discernment” allows a far richer representa-tion of beliefs. However, this comes at the cost of a substantial increase in complexity. Ifthere are n elements in the original set X , then there will be 2n (hence the definition ofthe power set) possible sets of subsets on which a belief mass will be assigned. For large


n, this is clearly intractable. Further, when the set is continuous, the set of all subsets isnot even measurable.

Given a frame of discernment, two mass assignments, m(A) and m(B), can be com-bined to yield a third mass assignment, m(C) = m(C | A,B) using Dempster’s rule ofcombination

m(C | A,B) = 1

1− κ

∑i,j|Ai∩Bj=C

m(A)m(B) (61)

where κ is a measure of inconsistency between the two mass assignments m(A) and m(B)defined as the mass product from the sets A and B which have zero intersection

κ =∑

i,j|Ai∩Bj=∅m(A)m(B) (62)

Note that the magnitude of κ is a measure of the degree to which the two mass distribu-tions are in conflict. Since mass distribution can ‘move freely’ between singletons of theset, two sets are not conflicting unless they have no common elements.

In addition to basic mass assignments, evidential reasoning introduces the concepts ofsupport Spt(C) and plausibility Pls(C) of any proposition C ⊆ X as

Spt(C | A,B) =1

1− κ

∑i,j|Ai∩Bj⊆C

m(Ai)m(Bj)

=∑

D|D⊆Cm(D | A,B) (63)

and

Pls(C | A,B) =1

1− κ

∑i,j|(Ai∩Bj)∩C =∅

m(Ai)m(Bj)

=∑

D|D∩C =∅m(D | A,B)

= 1− Spt(C | A,B) (64)

The support for a proposition C is the evidence directly assigned to the proposition or anysubset of the proposition C (that is, any proposition which implies C). The plausibility ofa proposition C is the sum of all the mass assigned to propositions which have a non-nullintersection with C (that is, all propositions which do not contradict C).Example 16

(After [9], p511). Spt(EFA) is the sum of masses assigned directly to the propositionthat an aircraft is any variant of the EFA. In contrast, the sum for Pls(EFA) wouldinclude mass assigned to any proposition that does not contradict the EFA. The plausibilitysum would include the mass propositions such as aircraft, fixed wing aircraft, fighter orfriend. The plausibility sum would not include mass of propositions such as foe bomberor helicopter.


There is no doubt that evidential reasoning will find a role in advanced data fusionsystems, particularly in areas such as attribute fusion, and situation assessment. However,it should be noted that the role played by probability theory in this is still generallyunresolved (see also [11, 13, 43]).


3 Multi-Sensor Estimation

Estimation is the single most important problem in sensor data fusion. Fundamentally,an estimator is a decision rule which takes as an argument a sequence of observations andwhose action is to compute a value for the parameter or state of interest. Almost all datafusion problems involve this estimation process: we obtain a number of observations froma group of sensors and using this information we wish to find some estimate of the truestate of the environment we are observing. Estimation encompasses all important aspectsof the data fusion problem. Sensor models are required to understand what information isprovided, environment models are required to relate observations made to the parametersand states to be estimated, and some concept of information value is needed to judgethe performance of the estimator. Defining and solving an estimation problem is almostalways the key to a successful data fusion system.

This section begins with a brief summary of the Kalman filter algorithm. The intentionis to introduce notation and key data fusion concepts; prior familiarity with the basicKalman Filter algorithm is assumed (see either [18] or the numerous excellent books onKalman filtering [12, 7, 28]). The multi-sensor Kalman filter is then discussed. Three mainalgorithms are considered; the group-sensor method, the sequential sensor method andthe inverse covariance form. The track-to-track fusion algorithm is also described. Theproblem of multiple-target tracking and data association are described. The three mostimportant algorithms for data association are introduced. Finally, alternative estimationmethods are discussed. In particular, maximum likelihood filters and various probability-distribution oriented methods. Subsequent sections consider the distributed Kalman filterand different data fusion architectures.

3.1 The Kalman Filter

The Kalman Filter is a recursive linear estimator which successively calculates an estimatefor a continuous valued state, that evolves over time, on the basis of periodic observationsthat of this state. The Kalman Filter employs an explicit statistical model of how theparameter of interest x(t) evolves over time and an explicit statistical model of how theobservations z(t) that are made are related to this parameter. The gains employed in aKalman Filter are chosen to ensure that, with certain assumptions about the observationand process models used, the resulting estimate x(t) minimises mean-squared error

L(t) =∫ ∞

−∞(x(t)− x(t))T (x(t)− x(t))P (x(t) | Zt)dx. (65)

Differentiation of Equation 65 with respect to x(t) and setting equal to zero gives

x(t) =∫ ∞

−∞x(t)P (x(t) | Zt)dx, (66)

which is simply the conditional mean x(t) = Ex(t) | Zt . The Kalman filter, and indeedany mean-squared-error estimator, computes an estimate which is the conditional mean;an average, rather than a most likely value.


The Kalman filter has a number of features which make it ideally suited to dealingwith complex multi-sensor estimation and data fusion problems. In particular, the explicitdescription of process and observations allows a wide variety of different sensor modelsto be incorporated within the basic algorithm. In addition, the consistent use statisticalmeasures of uncertainty makes it possible to quantitatively evaluate the role each sensorplaces in overall system performance. Further, the linear recursive nature of the algorithmensure that its application is simple and efficient. For these reasons, the Kalman filterhas found wide-spread application in many different data fusion problems [38, 5, 7, 28].

3.1.1 State and Sensor Models

The starting point for the Kalman filter algorithm is to define a model for the states tobe estimated in the standard state-space form;

x(t) = F(t)x(t) +B(t)u(t) +G(t)v(t), (67)

where

x(t) ∈ n is the state vector of interest,

u(t) ∈ s is a known control input,

v(t) ∈ q is a random variable describing uncertainty in the evolution of the state,

F(t) is the n× n state (model) matrix,

B(t) is the n× s input matrix, and

G(t) is the n× q noise matrix.

An observation (output) model is also defined in standard state-space form;

z(t) = H(t)x(t) +D(t)w(t), (68)

where

z(t) ∈ m is the observation vector,

w(t) ∈ r is a random variable describing uncertainty in the observation,

H(t) is the m× n observation (model) matrix,

D(t) is the m× r observation noise matrix.

These equations define the evolution of a continuous-time system with continuous obser-vations being made of the state. However, the Kalman filter is almost always implementedin discrete-time. It is straight-forward to obtain a discrete-time version of Equations 67and 68.


First, a discrete-time set t = t0, t1, · · · tk, · · · is defined. Equation 68 can be writtenin discrete time as

z(tk) = H(tk)x(tk) +D(tk)w(tk), ∀tk ∈ t (69)

where z(tk), x(tk) and w(tk) are the discrete-time observation, state and noise vectorsrespectively, and H(tk) and D(tk) the observation and noise models evaluated at thediscrete time instant tk. The discrete-time form of the state equation requires integrationof Equation 67 over the interval (tk, tk−1) as

x(tk) = Φ(tk, tk−1)x(tk−1) +∫ tktk−1

Φ(tk, τ)B(τ)u(τ)dτ +∫ tktk−1

Φ(tk, τ)G(τ)v(τ)dτ . (70)

where Φ(·, ·) is the state transition matrix satisfying the matrix differential equation

Φ(tk, tk−1) = F(tk)Φ(tk, tk−1), Φ(tk−1, tk−1) = 1. (71)

The state transition matrix has three important properties that should be noted:

1. It is uniquely defined for all t, t0 in [0,∞].

2. (The semi-group property) Φ(t3, t1) = Φ(t3, t2)Φ(t2, t1).

3. Φ(tk, tk−1) is non singular and Φ−1(tk, tk−1)=Φ(tk−1, tk).

When F(t) = F is a constant matrix, the state transition matrix is given by

Φ(tk, tk−1) = Φ(tk − tk−1) = expF(tk − tk−1). (72)

which clearly satisfies these three properties.If u(t) = u(tk) and v(t) = v(tk) remain approximately constant over the interval

(tk−1, tk) then the following discrete-time models can be defined;

F(tk)= Φ(tk, tk−1)

B(tk)=

∫ tktk−1

Φ(tk, τ)B(τ)dτ

G(tk)=

∫ tktk−1

Φ(tk, τ)G(τ)dτ .

(73)

Equation 70 can then be written in a discrete-time from equivalent to Equation 67 as

x(tk) = F(tk)x(tk−1) +B(tk)u(tk) +G(tk)v(tk). (74)

The accuracy of this model could be improved by taking mean values for both u(t) andv(t) over the sampling interval.


In almost all cases the time interval ∆t(k)= tk − tk−1 between successive samples of

the state remains constant. In this case it is common particle to drop the time argumentand simply index variables by the sample number. In this case Equation 74 is written as

x(k) = F(k)x(k − 1) +B(k)u(k) +G(k)v(k), (75)

and Equation 69 asz(k) = H(k)x(k) +D(k)w(k) (76)

Equations 75 and 76 are the model forms that will be used throughout this section unlessdiscussion of asynchronous data is relevant.

A basic assumption in the derivation of the Kalman filter is that the random sequencesv(k) and w(k) describing process and observation noise are all Gaussian, temporallyuncorrelated and zero-mean

Ev(k) = Ew(k) = 0, ∀k, (77)

with known covariance

Ev(i)vT (j) = δijQ(i), Ew(i)wT (j) = δijR(i). (78)

It is also generally assumed that the process and observation noises are also uncorrelated

Ev(i)wT (j) = 0, ∀i, j. (79)

These are effectively equivalent to a Markov property requiring observations and successivestates to be conditionally independent. If the sequences v(k) and w(k) are temporallycorrelated, a shaping filter can be used to ‘whiten’ the observations; again making theassumptions required for the Kalman filter valid [28]. If the process and observation noisesequences are correlated, then this correlation can also be accounted for in the Kalmanfilter algorithm [2]. If the sequence is not Gaussian, but is symmetric with finite moments,then the Kalman filter will still produce good estimates. If however, the sequence has adistribution which is skewed or otherwise ’pathological’, results produced by the Kalmanfilter will be misleading and there will be a good case for using a more sophisticatedBayesian filter [40]. Problems in which the process and observation models are non-linearare dealt in Section 3.1.7

We will use the following standard example of constant-velocity particle motion as thebasis for many of the subsequent examples on tracking and data fusion:

Example 17Consider the linear continuous-time model for the motion of particle moving with

approximately constant velocity:

[x(t)x(t)

]=[0 10 0

] [x(t)x(t)

]+[

0v(t)

].


0 10 20 30 40 50 60 70 80 90 100−5

0

5

10

15

20

25

Time(s)

Pos

ition

(m

)

True Target Track and Observations of Position

Target True PositionTarget Observations

Figure 11: Plot of true target position and observations made of target position for theconstant velocity target model described in Example 17, with σq = 0.01m/s and σr = 0.1m

In this case the state-transition matrix from Equation 72 over a time interval ∆T is givenby

Φ(∆T ) =[1 ∆T0 1

].

On the assumption that the process noise is white and uncorrelated with E[v(t)] = 0 andE[v(t)v(τ)] = σ2

q (t)δ(t− τ), then the equivalent discrete-time noise process is given by

v(k) =∫ ∆T

0

[τ1

]v(k∆T + τ)dτ =

[∆T 2/2∆T

]v(k).

With the definitions given in Equation 73, the equivalent discrete-time model is given by[x(k)x(k)

]=[1 ∆T0 1

] [x(k − 1)x(k − 1)

]+[∆T 2/2∆T

]v(k).

If observations are made at each time step k of the location of the particle the observationmodel will be in the form

zx = [ 1 0 ][x(k)x(k)

]+ w(k), Ew2(k) = σ2

r


Figure 11 shows a typical constant-velocity target motion generated according to thesemodels. The target position executes a random walk. Observations are randomly dispersedaround true target position.

These equations can trivially be extended to two (and three) dimensions giving a twodimensional model in the form:

x(k)x(k)y(k)y(k)

=

1 ∆T 0 00 1 0 00 0 1 ∆T0 0 0 1

x(k − 1)x(k − 1)y(k − 1)y(k − 1)

+

∆T 2/2 0∆T 00 ∆T 2/20 ∆T

[vx(k)vy(k)

], (80)

and

[zxzy

]=[1 0 0 00 0 1 0

]x(k)x(k)y(k)y(k)

+

[wx(k)wy(k)

](81)

with

R(k) = Ew(k)wT (k) =[σ2rx 00 σ2

ry

].

This motion model is widely used in target tracking problems (in practice as well as the-ory) as it is simple, linear and admits easy solution for multiple-target problems. Figure12 shows a typical x− y target plot generated using this model. Target velocity and head-ing can be deduced from the magnitude and orientation of the estimated velocity vector[x(k), y(k)]T (Figure 12 (c) and (d)). The plots show that this simple model is capable ofaccommodating many reasonable motion and maneuver models likely to be encountered intypical target tracking problems. It should however be noted that this track is equivalent torunning independent models for both x and y as shown in Figure 11. In reality, motionsin x and y directions will be physically coupled, and therefore correlated, by the headingof the target. This (potentially valuable information) is lost in this target model.

As this example shows, it is always best to start with a continuous-time model for thestate and then construct a discrete model, rather than stating the discrete model at theoutset. This allows for correct identification of noise transfers and noise correlations.

3.1.2 The Kalman Filter Algorithm

The Kalman filter algorithm produces estimates that minimise mean-squared estimationerror conditioned on a given observation sequence

x(i | j) = arg minx(i | j)∈n

E(x(i)− x)(x(i)− x)T | z(1), · · · , z(j) . (82)

As has been previously demonstrated (Equation 66) the estimate obtained is simply theexpected value of the state at time i conditioned on the observations up to time j. The


0 1 2 3 4 5 6 7 8 9 10

x 104

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45x 10

4

X−position (m)

Y−

Pos

ition

(m

)

True Target Track

(a)

2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4

x 104

1.14

1.15

1.16

1.17

1.18

1.19

1.2

1.21

1.22

1.23x 10

4

X−position (m)

Y−

Pos

ition

(m

)



(b)

0 50 100 150 200 250 300 350 400 450 500150

160

170

180

190

200

210

220

Time(s)

Vel

ocity

(m

/s)

True Target Velocity

(c)

0 50 100 150 200 250 300 350 400 450 500−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Time(s)

Hea

ding

(ra

ds)

True Target Heading

(d)

Figure 12: True target track for linear the uncoupled linear target model of Example 17(a) x − y track; (b) Detail of track and corresponding observations; (c) Deduced targetvelocity; (d) Deduced target heading.


estimate is thus defined as the conditional mean

x(i | j) = Ex(i) | z(1), · · · , z(j)

= Ex(i) | Zj . (83)

The estimate variance is defined as the mean squared error in this estimate

P(i | j) = E(x(i)− x(i | j))(x(i)− x(i | j))T | Zj . (84)

The estimate of the state at a time k given all information up to time k will be writtenas x(k | k). The estimate of the state at a time k given only information up to time k− 1is called a one-step-ahead prediction (or just a prediction) and is written as x(k | k − 1).

The Kalman filter algorithm is now stated without proof. Detailed derivations canbe found in many books on the subject, [28, 7] for example (see also [18]). The state isassumed to evolve in time according to Equation 75. Observations of this state are madeat regular time intervals according to Equation 76. The assumptions about the noiseprocesses entering the system, as described by Equations 77, 78 and 79, are assumed true.It is also assumed that an estimate x(k − 1 | k − 1) of the state x(k − 1) at time k − 1based on all observations made up to and including time k − 1 is available, and that thisestimate is equal to the conditional mean of the true state x(k − 1) conditioned on theseobservations. The conditional variance P(k − 1 | k − 1) in this estimate is also assumedknown. The Kalman filter then proceeds recursively in two stages:

Prediction: A prediction x(k | k − 1) of the state at time k and its covarianceP(k | k − 1)is computed according to

x(k | k − 1) = F(k)x(k − 1 | k − 1) +B(k)u(k) (85)

P(k | k − 1) = F(k)P(k − 1 | k − 1)FT (k) +G(k)Q(k)GT (k). (86)

Update: At time k an observation z(k) is made and the updated estimate x(k | k) ofthe state x(k), together with the updated estimate covariance P(k | k) is computed fromthe state prediction and observation according to

x(k | k) = x(k | k − 1) +W(k) (z(k)−H(k)x(k | k − 1)) (87)

P(k | k) = (1−W(k)H(k))P(k | k − 1)(1−W(k)H(k))T +W(k)R(k)WT (k) (88)

where the gain matrix W(k) is given by

W(k) = P(k | k − 1)H(k)[H(k)P(k | k − 1)HT (k) +R(k)

]−1(89)

The Kalman filter is recursive or cyclic (see Figure 13). We start with an estimate,generate a prediction, make an observation, then update the prediction to an estimate.The filter makes explicit use of the process model in generating a prediction and the


State Predictionx(k|k-1)=

F(k)x(k-1|k-1)+G(k)u(k)

State Estimate at tk-1

x(k-1|k-1)

MeasurementPrediction

z(k|k-1)=H(k)x(k|k-1)

Innovationν(k)=z(k)-z(k|k-1)

Updated StateEstimate

x(k|k)=x(k|k-1)+W(k) ν(k)

State Error Covariance at tk-1

P(k-1|k-1)

State Prediction Covariance

P(k|k-1)=F(k)P(k-1|k-1)F(k)+Q(K)

Innovation CovarianceS(k)=H(k)P(K|k-1)H'(k)+R(k)

Filter GainW(k)=P(k|k-1)H'(k)S -1(k)

Updated StateCovariance

P(k|k)=P(k|k-1)-W(k)S(k)W'(k)

State CovarianceComputation

Estimationof State

True State

Control at tk

u(k)

State Transitionx(k)=F(k)x(k-1)+G(k)u(k)+v(k)

Measurement at t k

z(k)=H(k)x(k)+w(k)

Figure 13: Block diagram of the Kalman filter cycle (after Barshalom and Fortmann 1988[7])

observation model in generating an update. The update stage of the filter is clearlylinear, with a weight W(k) being associated with the observation z(k) and a weight1−W(k)H(k) being associated with the prediction x(k | k − 1). The Kalman filter alsoprovides a propagation equation for the covariance in the prediction and estimate.

Example 18It is straight-forward to build a state-estimator for the target and observation model

of example Example 17. The Kalman filter Equations 87–89 are implemented with F(k),Q(k) as defined in Equation 80 and with H(k) and R(k) as defined in Equation 81.Initial conditions for x(0 | 0) are determined from the first few observations and withP(0 | 0) = 10Q(k) (see [18] for details). The results of the filter implementation areshown in Figure 14. The Figure shows results for the x component of the track. It shouldbe noted that the estimate always lies between the observation and prediction (it is aweighted sum of these terms). Note also the error (between true and estimated) velocity


and heading is zero mean and white indicating a good match between model and estimator.

3.1.3 The Innovation

A prediction can be made as to what observation will be made at a time k based on theobservations that have been made up to time k − 1 by simply taking expectations of theobservation Equation 76 conditioned on previous observations:

z(k | k − 1)= Ez(k) | Zk−1= EH(k)x(k) +W(k) | Zk−1= H(k)x(k | k − 1) (90)

The difference between the observation z(k) and the predicted observationH(k)x(k | k − 1)is termed the innovation or residual ν(k):

ν(k) = z(k)−H(k)x(k | k − 1) (91)

The innovation is an important measure of the deviation between the filter estimates andthe observation sequence. Indeed, because the ‘true’ states are not usually available forcomparison with the estimated states, the innovation is often the only measure of how wellthe estimator is performing. The innovation is particularly important in data association.

The most important property of innovations is that they form an orthogonal, uncor-related, white sequence,

Eν(k) | Zk−1 = 0, Eν(i)νT (j) = S(i)δij, (92)

whereS(k) = R(k) +H(k)P(k | k − 1)H(k) (93)

is the innovation covariance. This can be exploited in monitoring of filter performance.The innovation and the innovation variance can be used to express an alternative,

simpler form of the update equations:

x(k | k) = x(k | k − 1) +W(k)ν(k) (94)

P(k | k) = P(k | k − 1)−W(k)S(k)WT (k) (95)

and from Equation 89W(k) = P(k | k − 1)H(k)S−1(k) (96)

This is the preferred form of the update Equations in the remainder of this course.

Example 19Continuing Example 18, the innovation and innovation covariance can be calculated

from Equations 91 and 93. These are shown in Figure 15(a). The most important pointsto note are that the innovation sequence is zero mean and white, and that approximately


0 1 2 3 4 5 6 7 8 9 10

x 104

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45x 10

4

Time(s)

Pos

ition

(m

)

Estimated and Predicted Target Track

Predicted Target PositionEstimated Target Position

(a)

5.4 5.5 5.6 5.7 5.8 5.9

x 104

1.42

1.425

1.43

1.435

1.44

1.445

1.45x 10

4

X position (m)

Y P

ositi

on (

m)

Predicted, Observed and Estimated Target Position

Predicted Target PositionEstimated Target PositionPosition Observation

(b)

0 50 100 150 200 250 300 350 400 450 500−10

0

10

20

30

40

50

60

Time(s)

Vel

ocity

Err

or (

m/s

)

Error in Estimated Target Velocity

(c)

0 50 100 150 200 250 300 350 400 450 500−0.06

−0.04

−0.02

0

0.02

0.04

0.06

Time(s)

Hea

ding

Err

or (

rads

)

Error in Estimated Target Heading

(d)

Figure 14: Estimated target track for the linear uncoupled target model of Examples 17and 18 (a) Estimate and prediction of x − y track; (b) Detail of estimated track andcorresponding observations. Note the estimate always lies between the observation andprediction; (c) Error in estimated target velocity; (d) Error in estimated target heading.Note both heading and velocity errors are zero mean and white.


65% of all innovations lie within the ‘one-sigma’ standard deviation bounds. Figure 15(b)shows the standard deviation (estimated error) in state prediction and state estimate (forthe position state). It is clear that the standard deviation, and therefore covariance, rapidlyconverge to a constant value. The innovation covariance and gain matrix will also there-fore converge to a constant value.

3.1.4 Understanding the Kalman Filter

There are three different, but related ways of thinking about the Kalman filter algorithm:as a simple weighted average; as a conditional mean computed from knowledge of thecorrelation between observation and state; and as an orthogonal decomposition and pro-jection operation. Each of these interpretations is exploited at different points in thiscourse.

The interpretation of the Kalman filter as weighted average follows directly from Equa-tion 87 as

x(k | k) = [1−W(k)H(k)x(k | k − 1)] +W(k)z(k) (97)

It is interesting to explicitly determine the weights, 1−W(k)H(k) and W(k), associatedwith the prediction and observation respectively. Pre-multiplying Equation 87 by H(k),substituting for the gain matrix and simplifying, gives

H(k)x(k | k) =[(H(k)P(k | k − 1)H(k))−1 +R−1(k)

]−1

×[(H(k)P(k | k − 1)H(k))−1H(k)x(k | k − 1) +R−1(k)z(k)

].

This is just a weighted sum of the predicted observation H(k)x(k | k − 1) and the ac-tual observation z(k) in which (H(k)P(k + 1 | k)H(k))−1 is the inverse of the predic-tion covariance projected in to observation space and is the ‘confidence’ in the pre-dicted observation, and R−1(k) is the inverse observation noise covariance and is the‘confidence’ in the observation itself. The sum is normalised by the total confidence[(H(k)P(k | k − 1)H(k))−1 + R−1(k)]−1 and the result is a new estimate for the stateprojected into the observation space as H(k)x(k | k). The covariance in this projectedestimate is given simply by the normalisation factor of the weighted sum

H(k)P(k | k)H(k) =[(H(k)P(k | k − 1)H(k))−1 +R−1(k)

]−1(98)

This averaging process works by first projecting the state vector into observation spaceas H(k)x(k | k − 1) and corresponding covariance as H(k)P(k | k − 1)H(k), where theobservation z(k) and its corresponding covariance R(k) are directly available, and thencomputes an updated estimate H(k)x(k | k) as a weighted sum of these, again in obser-vation space. Clearly, this interpretation of the Kalman filter holds only for those stateswhich can be written as a linear combination of the observation vector.

The Kalman filter is also able to provide estimates of states which are not a linearcombination of the observation vector. It does this by employing the cross correlation


0 50 100 150 200 250 300 350 400 450 500−200

−150

−100

−50

0

50

100

Time(s)

Inno

vatio

n (m

)

X Innovations and Innovation Standard Deviation

Innovation Innovation Standard Deviation

(a)

0 50 100 150 200 250 300 350 400 450 5006.5

7

7.5

8

8.5

9

9.5

10

10.5

11

11.5

Time(s)

Pos

ition

Err

or (

m)

Estimated and Predicted Position Standard Deviations

Predicted Postion SD Estimated Position SD

(b)

Figure 15: Track Errors for the linear uncoupled target model of Examples 17 and 18:(a) Innovation and innovation variance in x estimate; (b) Estimated error (standard de-viation) in x estimate and x prediction.


between observed and unobserved states to find that part of the observation informationwhich is correlated with the unobserved states. In this sense, the Kalman filter is some-times referred to as a linear observer. This interpretation of the Kalman filter algorithmis most easily appreciated by considering the estimator as derived from the conditionalmean, computed directly from Bayes theorem [14, 28]. In this case the estimator ofEquation 87 may be written as

x(k | k) = x(k | k − 1) +PxzP−1zz [z(k)−H(k)x(k | k − 1)] (99)

with Equation 88 being written as

P(k | k) = Pxx −PxzP−1zz P

Txz (100)

where

Pxz = E(x(k)− x(k | k − 1))(z(k)−H(k)x(k | k − 1))T | Zk−1Pzz = E(z(k)−H(k)x(k | k − 1))(z(k)−H(k)x(k | k − 1))T | Zk−1Pxx = P(k | k − 1)

are respectively the cross correlation between prediction and observation, the observationcovariance and the prediction covariance. Equation 99 shows how new observations z(k)contribute to the updated estimate x(k | k). First, the difference between observation andpredicted observation is computed. Then the term Pxz, by definition, tells us how thestates are correlated with the observations made, and finallyP−1

zz is used to ‘normalise’ thiscorrelation by the confidence we have in the new observation. This shows that providedwe have knowledge of how the states are correlated to the observations made, we may usethe observation information to estimate the state even in cases where the state is not alinear combination of the observation vector.

A third method of interpreting the Kalman filter is as geometric operation that projectsthe true state x(k) on to the subspace spanned by the observation sequence Zk. Figure 16shows this graphically using multi-dimensional space representation. The space spannedby the set Zk is shown as a plane embedded in the overall state space, Zk−1 as a lineembedded in the Zk plane, the new observation z(k) as a line so arranged that Zk−1 andz(k) together generate Zk. The innovation ν(k) is described as a line orthogonal to Zk−1

also embedded in the Zk. The prediction x(k | k − 1) lies in the space Zk−1, W(k)ν(k)lies in the space orthogonal to Zk−1, and the estimate x(k | k) lies in the space Zk and isagain the projection of x(k) in to this space. Figure 16 makes clear the role of W(k) and[1−W(k)H(k)] as complimentary projections. The estimate can be thought of either asthe orthogonal sum of prediction x(k | k − 1) and weighted innovation W(k)ν(k), or asthe sum of the parallel vectors [1−W(k)H(k)]x(k | k − 1) and W(k)z(k).

3.1.5 Steady-State Filters

When filtering is synchronous, and state and observation matrices time-invariant, thestate and observation models may be written as

x(k) = Fx(k − 1) +Gu(k) + v, (101)


Zk-1

Zk

Xk

x(k)

ν(k)z(k)

x(k|k)

x(k)-x(k|k-1)

W(k) ν(k)

W(k)H(k)x(k|k-1) I-W(k)H(k)x(k|k-1)

Figure 16: A projection diagram for the Kalman filter. The complete space is the truestate space Xk. The sub-space spanned by the first k observations is shown as the ellipseZk. The space Zk−1 spanned by the first k − 1 observations is shown as a line sub-spaceof the space Zk. The true state x(k) lies in the space Xk, the estimate x(k | k) lies in thespace Zk (because x(k | k) is a linear combination of observations up to time k), predictionx(k | k − 1) lies in the space Zk−1. The estimate is a perpendicular projection of the truestate on to the space Zk. By construction, the innovation space ν(k) is perpendicularto the space Zk−1. The new estimate x(k | k) is the sum of: 1) the projection of x(k)onto Zk−1 along z(k), given by [1 −W(k)H(k)]x(k | k − 1); and 2) the projection ofx(k) onto z(k) along Zk−1, given by W(k)z(k). For comparison, the estimate that wouldresult based only on the observation z(k) is x(k | z(k)), the projection of x(k) onto z(k).Diagramatic method from [41] (see also [14])


z(k) = Hx(k) +w, (102)

withEvvT = Q, EwwT = R. (103)

The calculations for the state estimate covariances and filter gains do not depend on thevalue of the observations made. Consequently, if F, H, Q, and R are time invariant, thenthe innovation covariance S will be constant and the covariance and gain matrices willalso tend to a constant steady state value

P(k | k)→ P+∞, P(k | k − 1)→ P−

∞, W(k)→W∞ (104)

This convergence is often rapid as is graphically demonstrated in Figure 15.If the covariances and gain matrix all tend to a constant steady-state value after only a

few time-steps, it seems sensible that the computationally intensive process of computingthese values on-line be avoided by simply inserting these constant values in to the filterfrom the start. In effect, this means that a constant gain matrix is used in the computationof the estimates

x(k | k) = x(k | k − 1) +W [z(k)−Hx(k | k − 1)] . (105)

The steady-state value of the gain matrixW is most easily computed by simply simulatingthe covariance calculation off-line (which does not require that any observations are made).

In the case of a ‘constant velocity’ model, the gain matrix can be described by twodimensionless coefficients α and β,

x(k | k) = x(k | k − 1) +[

αβ/T

][z(k)−Hx(k | k − 1)] . (106)

This is known as the α–β filter. In the case of a ‘constant acceleration’ model, the gainmatrix can be described by three dimensionless coefficients α, β and γ,

x(k | k) = x(k | k − 1) +

αβ/Tγ/T 2

[z(k)−Hx(k | k − 1)] . (107)

This is known as the α–β–γ filter. These constant parameters are usually obtained throughsimulation and a degree of judgement.

The steady-state filter provides a substantial reduction in the amount of on-line com-putation that must be performed by the filter. In many situations this may be invaluable;particularly in multi-target tracking applications where many thousands of filters mustbe employed simultaneously. Providing that the assumption of constant system and noisemodels applies, the error incurred by using a steady-state filter is insignificant. In prac-tice, steady-state filters are used even in situations where the basic assumptions of lineartime-invariance and constant noise injection are violated.


0 10 20 30 40 50 60

−16

−14

−12

−10

−8

−6

−4

−2

0

Difference in Position Estimate Produced by Full Gain and Steady−State Gain Estimators

Time (s)

Diff

eren

ce in

Pos

ition

Est

imat

e (m

)

Figure 17: Plot of the difference between x position estimates computed using a fullgain history and a steady state-gain for the constant velocity target model described inExample 17.

Example 20Consider again the Example 18 of constant-velocity particle motion. With a process

and observation noise standard deviations given by σq = 0.01, and σr = 0.1 respectivelyand with a constant sample period of one second, the steady-state gain matrix is found tobe W = [0.4673, 0.1460]T . The equivalent steady-state filter (with T = 1) will thereforehave parameters α = 0.4673 and β = 0.1460. Figure 17 shows the error between estimatesproduced by the constant-velocity filter (with full covariance and gain calculation), and theestimates produced by the equivalent α–β filter. It can be seen that after an initially largeerror, but after only a few time-steps, the error between these two filters is negligible. Thelarge error at the start is caused by initialisation with the steady-state values of gain.

3.1.6 Asynchronous, Delayed and Asequent Observations

In data fusion systems the observation process is almost never synchronous. This is be-cause information must come from many different sensors, which will each have differentsample rates and latencies associated with the acquisition and communication of obser-vations.

Generally, three timing issues of increasing complexity must be addressed;

• Asynchronous data: Information which arrives at random intervals but in a timelymanner.


• Delayed data: Information that arrives both at random intervals and late; after anestimate has already been constructed.

• Asequent data: Information that arrives randomly, late and out of time sequence(temporal order).

The asynchronous data problem is relatively easy to accommodate in the existing Kalmanfilter framework. For asynchronous observations, the general discrete-time state and ob-servation models of Equations 74 and 69 apply. With these definitions, the Kalman filterfor asynchronous observations remains essentially the same as for the synchronous case.It is assumed the existence of an estimate for the state x(tk−1 | tk−1) at time tk−1 thatincludes all observations made up to this time. At a later time tk > tk−1 an observationis made according to Equation 69. The first step is simply to generate a prediction of thestate at the time the observation was made according to

x(tk | tk−1) = F(tk)x(tk−1 | tk−1) +B(tk)u(tk), (108)

together with an associated prediction covariance

P(tk | tk−1) = F(tk)P(tk−1 | tk−1)FT (tk) +G(tk)Q(tk)G

T (tk). (109)

An innovation and innovation covariance are then computed according to

ν(tk) = z(tk)−H(tk)x(tk | tk−1), (110)

S(tk) = H(tk)P(tk | tk−1)HT (tk) +R(tk), (111)

from which an updated state estimate and associated covariance can be found from

x(tk | tk) = x(tk | tk−1) +W(tk)ν(tk), (112)

P(tk | tk) = P(tk | tk−1)−W(tk)S(tk)WT (tk), (113)

whereW(tk) = P(tk | tk−1)H

T (tk)S(tk) (114)

is the associated gain matrix. The main point to note here is that the injected processnoise variance Q(tk) and all the model matrices (F(tk), B(tk), G(tk), and H(tk)) are allfunctions of the time interval tk − tk−1. Consequently, the computed variances P(tk | tk),P(tk | tk−1), S(tk), and the gain matrix W(tk) will not remain constant and must becomputed at every time slot.

Example 21Consider again the example of the constant-velocity particle motion described by the

continuous-time process model

[x(t)x(t)

]=[0 10 0

] [x(t)x(t)

]+[

0v(t)

].


300 305 310 315 320 325 330 335 340 345 3509700

9750

9800

9850

9900

9950

10000

10050

10100

10150

10200

Time (s)

Pos

ition

(m

)



(a)

300 305 310 315 320 325 330 335 340 345 3509700

9750

9800

9850

9900

9950

10000

10050

10100

10150

10200

Time (s)

Pos

ition

(m

)

Predicted, Observed and Estimated Target Position

Predicted Target PositionEstimated Target PositionPosition Observation

(b)

300 305 310 315 320 325 330 335 340 345 350−40

−30

−20

−10

0

10

20

30

40

Time(s)

Inno

vatio

n (m

)

X Innovations and Innovation Standard Deviation

Innovation Innovation Standard Deviation

(c)

300 305 310 315 320 325 330 335 340 345 350

5

6

7

8

9

10

11

12

13

Time(s)

Pos

ition

Err

or (

m)

Estimated and Predicted Position Standard Deviations

Predicted Postion SD Estimated Position SD

(d)

Figure 18: Estimated target track for the linear constant velocity particle model withasynchronous observations and updates. A time-section shown of: (a) True state andasynchronous observations; (b) State predictions and estimates, together with observa-tions; (c) Innovations and innovation standard deviations; (d) State prediction and stateestimate standard deviations (estimated errors).


We assume observations of the position of the particle are made asynchronously at timestk according to

zx(tk) = [ 1 0 ][x(tk)x(tk)

]+ w(tk), Ew2(tk) = σ2

r .

We define δtk = tk − tk−1 as the time difference between any two successive observationsmade at time tk−1 and tk. With the definitions given by Equation 73, the discrete-timeprocess model becomes

[x(tk)x(tk)

]=[1 δtk0 1

] [x(tk−1)x(tk−1)

]+[δt2

δt

]v(tk),

with

Q(tk) = E[v(tk)vT (tk)] =

∫ δt0

[δt1

][ δt 1 ] qdτ =

[δt3/3 δt2/2δt2/2 δt

]q.

The implementation of an asynchronous filter looks similar to the implementation of asynchronous filter except that the matrices defined in Equation 73 must be recomputed atevery time-step. However, the variances computed by the asynchronous filter look quitedifferent from those computed in the synchronous case even if they have the same averagevalue. Figure 18 shows the results of an asynchronous constant-velocity filter. The processand observation noise standard-deviations have been set, as before, to σq = 0.01 andσr = 0.1. Observations are generated with random time intervals uniformly distributedbetween 0 and 2 seconds (an average of 1 second). Figure 18(a) shows the computedposition estimate and prediction standard deviations . Clearly, these covariances are notconstant, unsurprisingly, the variance in position estimate grows in periods where thereis a long time interval between observations and reduces when there is a small intervalbetween observations. Figure 18(b) shows the filter innovations and associated computedstandard deviations. Again, these are not constant, however they do still satisfy criteriafor whiteness with 95% of innovations falling within their corresponding ±2σ bounds.

To deal with delayed observations is to maintain two estimates, one associated withthe true time at which the last observation was obtained, and the second a prediction,based on this last estimate, which describes the state at the current time. When a new,delayed, observation is made, the current prediction is discarded and a new prediction upto the time of this delayed observation is computed, based on the estimate at the timeof the previous delayed observation. This is then combined with the new observation toproduce a new estimate which is itself predicted forward to the current time. Let tc be thecurrent time at which a new delayed observation is made available, let tp be the previoustime a delayed observation was made available and let tc > tk > tk−1, and tc > tp > tk−1.We assume that we already have an estimate x(tk−1 | tk−1) and an estimate (strictly aprediction) x(tp | tk−1) and that we acquire an observation z(tk) taken at time tk. We startby simply discarding the estimate x(tp | tk−1), and generating a new prediction x(tk | tk−1)and prediction covariance P(tk | tk−1) from Equations 108 and 109. These together with


the delayed observation are used to compute a new estimate x(tk | tk), at the time thedelayed observation was made, from Equations 112, 113, and 114. This estimate is thenpredicted forward to the current time to produce an estimate x(tc | tk) and its associatedcovariance P(tc | tk) according to

x(tp | tk) = F(tp)x(tk | tk) +B(tp)u(tp), (115)

P(tp | tk) = F(tp)P(tk | tk)FT (tp) +G(tp)Q(tp)GT (tp). (116)

Both estimates x(tk | tk) and x(tp | tk) together with their associated covariances shouldbe maintained for the next observations.

It should be clear that if the observations are delayed, then the estimate provided atthe current time will not be as good as the estimates obtained when the observations areobtained immediately. This is to be expected because the additional prediction requiredinjects additional process noise into the state estimate.

It is also useful to note that the same techniques can be used to produce estimatesfor any time in the future as well as simply the current time. This is sometimes use-ful in providing advance predictions that are to be used to synchronize with incomingobservations.

Asequent data occurs when the observations made are delayed in such a way thatthey arrive at the filter for processing out of time-order. Although this does not oftenhappen in single-sensor systems, it is a common problem in multi-sensor systems wherethe pre-processing and communication delays may differ substantially between differentsensors. The essential problem here is that the gain matrix with and without the delayedobservation will be different and the previous estimate, corresponding to the time atwhich the delayed observation was taken, cannot easily be “unwrapped” from the currentestimate. With the tools we have so far developed, there is no simple way of dealing withthis problem other than by remembering past estimates and recomputing the currentestimate every time an out-of-order observation is obtained. However, a solution to thisproblem is possible using the inverse-covariance filter which we will introduce latter inthis chapter.

3.1.7 The Extended Kalman Filter

The extended Kalman filter (EKF) is a form of the Kalman filter that can be employedwhen the state model and/or the observation model are non-linear. The EKF is brieflydescribed in this section.

The state models considered by the EKF are described in state-space notation by afirst order non-linear vector differential equation or state model of the form

x(t) = f [x(t),u(t),v(t), t], (117)

where

x(t) ∈ n is the state of interest,


u(t) ∈ r is a known control input,

f [·, ·, ·] a mapping of state and control input to state ‘velocity’, and

v(t) a random vector describing both dynamic driving noise and uncertainties in thestate model itself (v(t) is often assumed additive).

The observation models considered by the EKF are described in state-space notation bya non-linear vector function in the form

z(t) = h[x(t),u(t),w(t), t] (118)

where

z(t) ∈ m is the observation made at time t,

h[·, ·, ·] is a mapping of state and control input to observations, and

w(t) a random vector describing both measurement corruption noise and uncertaintiesin the measurement model itself (w(t) is often assumed additive).

The EKF, like the Kalman filter, is almost always implemented in discrete-time. To dothis, a discrete form of Equations 117 and 118 are required. First, a discrete-time sett = t0, t1, · · · tk, · · · is defined. Equation 118 can then be written in discrete time as

z(tk) = h[x(tk),u(tk),w(tk), tk], ∀tk ∈ t (119)

where z(tk), x(tk) and w(tk) are the discrete-time observation, state and noise vectorsevaluated at the discrete time instant tk. The discrete-time form of the state equationrequires integration of Equation 117 over the interval (tk, tk−1) as

x(tk) = x(tk−1) +∫ tktk−1

f [x(τ),u(τ),v(τ), τ ]dτ . (120)

In practice, this integration is normally solved using a simple Euler (backward difference)approximation as

x(tk) = x(tk−1) + ∆Tkf [x(tk−1),u(tk−1),v(tk−1), tk−1], (121)

where ∆Tk = tk − tk−1. As with the Kalman filter, when the sample interval is constant,time is indexed by k and Equations 121 and 119 are written as

x(k) = x(k − 1) + ∆T f [x(k − 1),u(k − 1),v(k − 1), k], (122)

andz(k) = h[x(k),u(k),w(k), k]. (123)

To apply the Kalman filter algorithm to estimation problems characterised by non-linear state and observation models, perturbation methods are used to linearise true non-linear system models around some nominal state trajectory to yield a model which is


itself linear in the error. Given a non-linear system model in the form of Equation 117,a nominal state trajectory is described using the same process model (assuming v(t) iszero mean),

xn(t) = f [xn(t),un(t), t]. (124)

Then, Equation 117 is expanded about this nominal trajectory as a Taylor series;

x(t) = f [xn(t),un(t), t]

+∇fx(t)δx(t) +O[δx(t)2

]+∇fu(t)δu(t) +O

[δu(t)2

]+∇fv(t)v(t), (125)

where

∇fx(t) =

∂f

∂x

∣∣∣∣∣x(t)=xn(t)u(t)=un(t)

, ∇fu(t) =

∂f

∂u


, ∇fv(t) =

∂f

∂v


(126)

andδx(t)

= (x(t)− xn(t)) , δu(t)

= (u(t)− un(t)) . (127)

Subtracting Equation 124 from Equation 125 provides a linear error model in the form

δx(t) = ∇fx(t)δx(t) +∇fu(t)δu(t) +∇fv(t)v(t). (128)

Identifying F(t) = ∇fx(t), B(t) = ∇fu(t), and G(t) = ∇fv(t), Equation 128, is now inthe same form as Equation 67 and may be solved for the perturbed state vector δx(t)in closed form through Equation 70, yielding a linear discrete-time equation for errorpropagation. Clearly this approximation is only valid when terms of second order andhigher are small enough to be neglected; when the true and nominal trajectories are closeand f(·) is suitably smooth. With judicious design of the estimation algorithm this canbe achieved surprisingly often. It is also possible to retain or approximate higher-orderterms from Equation 125 and so improve the validity of the approximation.

Similar arguments can be applied to linearise Equation 68 to provide an observationerror equation in the form

δz(t) = ∇hx(t)δx(t) +∇fh(t)δw(t). (129)

Identifying H(t) = ∇fx(t), and D(t) = ∇fw(t), Equation 129, is now in the same form asEquation 68.

The discrete-time extended Kalman filter algorithm can now be stated. With appro-priate identification of discrete time states and observations, the state model is writtenas

x(k) = f (x(k − 1),u(k),v(k), k) , (130)

and the observation model as

z(k) = h (x(k),w(k)) . (131)


Like the Kalman filter, it is assumed that the noises v(k) and w(k) are all Gaussian, tem-porally uncorrelated and zero-mean with known variance as defined in Equations 77–79.The EKF aims to minimise mean-squared error and therefore compute an approximationto the conditional mean. It is assumed therefore that an estimate of the state at timek − 1 is available which is approximately equal to the conditional mean,

x(k − 1 | k − 1) ≈ Ex(k − 1) | Zk−1 . (132)

The extended Kalman filter algorithm will now be stated without proof. Detailed deriva-tions may be found in any number of books on the subject. The principle stages in thederivation of the EKF follow directly from those of the linear Kalman filter with additionalstep that the process and observation models are linearised as a Taylor’s series about theestimate and prediction respectively. The algorithm has two stages:

Prediction: A prediction x(k | k − 1) of the state at time k and its covarianceP(k | k − 1)is computed according to

x(k | k − 1) = f (x(k − 1 | k − 1),u(k)) (133)

P(k | k − 1) = ∇fx(k)P(k − 1 | k − 1)∇T fx(k) +∇fv(k)Q(k)∇T fv(k) (134)

Update: At time k an observation z(k) is made and the updated estimate x(k | k) ofthe state x(k), together with the updated estimate covariance P(k | k) is computed fromthe state prediction and observation according to

x(k | k) = x(k | k − 1) +W(k) [z(k)− h(x(k | k − 1))] (135)

P(k | k) = P(k | k − 1)−W(k)S(k)WT (k) (136)

whereW(k) = P(k | k − 1)∇Thx(k)S

−1(k) (137)

andS(k) = ∇hx(k)P(k | k − 1)∇Thx(k) +∇hw(k)R(k)∇Thw(k). (138)

and where the Jacobian ∇f·(k) is evaluated at x(k − 1) = x(k − 1 | k − 1) and ∇h·(k) isevaluated at and x(k) = x(k | k − 1).

A comparison of Equations 85–96 with Equations 133–138 makes it clear that theextended Kalman filter algorithm is very similar to the linear Kalman filter algorithm,with the substitutions F(k)→ ∇fx(k) and H(k)→ ∇hx(k) being made in the equationsfor the variance and gain propagation. Thus, the extended Kalman filter is, in effect, alinear estimator for a state error which is described by a linear equation and which isbeing observed according to a linear equation of the form of Equation 76.

The extended Kalman filter works in much the same way as the linear Kalman filterwith some notable caveats:


• The Jacobians ∇fx(k) and ∇hx(k) are typically not constant, being functions ofboth state and timestep. This means that unlike the linear filter, the covariancesand gain matrix must be computed on-line as estimates and predictions are madeavailable, and will not in general tend to constant values. This significantly increasethe amount of computation which must be performed on-line by the algorithm.

• As the linearised model is derived by perturbing the true state and observationmodels around a predicted or nominal trajectory, great care must be taken to ensurethat these predictions are always ‘close enough’ to the true state that second orderterms in the linearisation are indeed insignificant. If the nominal trajectory is toofar away from the true trajectory then the true covariance will be much larger thanthe estimated covariance and the filter will become poorly matched. In extremecases the filter may also become unstable.

• The extended Kalman filter employs a linearised model which must be computedfrom an approximate knowledge of the state. Unlike the linear algorithm, this meansthat the filter must be accurately initialized at the start of operation to ensure thatthe linearised models obtained are valid. If this is not done, the estimates computedby the filter will simply be meaningless.

3.2 The Multi-Sensor Kalman Filter

Many of the techniques developed for single sensor Kalman filters can be applied directlyto multi-sensor estimation and tracking problems. In principle, a group of sensors canbe considered as a single sensor with a large and possibly complex observation model. Inthis case the Kalman filter algorithm is directly applicable to the multi-sensor estimationproblem. However, as will be seen, this approach is practically limited to relatively smallnumbers of sensors.

A second approach is to consider each observation made by each sensor as a separateand independent realization, made according to a specific observation model, which canbe incorporate into the estimate in a sequential manner. Again, single-sensor estimationtechniques, applied sequentially, can be applied to this formulation of the multi-sensorestimation problem. However, as will be seen, this approach requires that a new predictionand gain matrix be calculated for each observation from each sensor at every time-step,and so is computationally very expensive.

A third approach is to explicitly derive equations for integrating multiple observationsmade at the same time into a common state estimate. Starting from the formulation ofthe multi-sensor Kalman filter algorithm, employing a single model for a group of sensors,a set of recursive equations for integrating individual sensor observations can be derived.As will be seen, these equations are more naturally expressed in terms of ‘information’rather than state and covariance.

The systems considered to this point are all ‘centralized’; the observations made bysensors are reported back to a central processing unit in a raw form where they areprocessed by a single algorithm in much the same way as single sensor systems. It is also


possible to formulate the multi-sensor estimation problem in terms of a number of localsensor filters, each generating state estimates, which are subsequently communicated inprocessed form back to a central fusion centre. This distributed processing structure hasa number of advantages in terms of modularity of the resulting architecture. However,the algorithms required to fuse estimate or track information at the central site can bequite complex.

In this section, the multi-sensor estimation problem is first defined in terms of a set ofobservation models and a single common process model. Each of the three centralised pro-cessing algorithms described above will be developed and compared. techniques describedabove; deriving appropriate equations and developing simple examples. The following sec-tion will then show how these multi-sensor estimation algorithms are applied in simpletracking problems.

Example 22This example introduces a fundamental tracking problem which is further developed in

the remainder of this section. The problem consists of the tracking of a number of targetsfrom a number of tracking stations. The simulated target models and observations arenon-linear, while the tracking algorithms used at the sensor sites are linear. This is fairlytypical of actual tracking situations.

The targets to be tracked are modeled as 2-dimensional platforms with controllableheading and velocity. The continuous model is given by:

x(t)y(t)φ(t)

=

V (t) cos(φ+ γ)V (t) sin(φ+ γ)

V (t)κ

sin(γ)

where (x(t), y(t)) is the target position, φ(t) is the target orientation, V (t) and γ(t) are theplatform velocity and heading, and κ is a constant minimum instantaneous turn radiusfor the target. Then x(t) = [x(t), y(t), φ(t)]T is defined as the state of the target andu(t) = [V (t), γ(t)]T as the target input.

This model can be converted into a discrete time model using a constant (Euler) inte-gration rule over the (asynchronous) sample interval ∆Tk = tk − tk−1 as

x(k)y(k)φ(k)

=

x(k − 1) + ∆TV (k) cos(φ(k − 1) + γ(k))y(k − 1) + ∆TV (k) sin(φ(k − 1) + γ(k))

∆Tφ(k − 1) + V (k)κ

sin(γ(k))

which is in the standard form x(k) = f(x(k − 1),u(k)). For the purposes of simulation,the model is assumed driven by Brownian models for velocity and heading,

V (t) = V (t) + vV (t), γ(t) = γ(t) + vγ(t),

where v(t) = [vV (t), vγ(t)]T is a zero mean white driving sequence with strength

Ev(t)vT (t) =[σ2v 00 σ2

γ

]


The discrete-time model for this is simply

V (k) = V (k − 1) + ∆TkvV (k), γ(k) = γ(k − 1) + ∆Tkvγ(k).

A group of typical trajectories generated by this model is shown in Figure 19(a).The targets are observed using sensors (tracking stations) i = 1, · · · , N whose location

and pointing angles Ti(t) = [Xi(t), Yi(t), ψi(t)]T are known. This does not preclude the

sensors from actually being in motion. Each sensor site i is assumed to make range andbearing observations to the jth targets as

[zrij(k)zθij(k)

]=

√(xj(k)−Xi(k))

2 + (yj(k)− Yi(k))2

arctan(yj(k)−Yi(k)

xj(k)−Xi(k)

)− ψi(k)

+ [wrij(k)

wθij(k)

],

where the random vector wij(k) = [rrij(k), rθij(k)]

Tdescribes the noise in the observation

process due to both modeling errors and uncertainty in observation. Observation noiseerrors are taken to be zero mean and white with constant variance

Ewij(k)wTij(k) =[σ2r 00 σ2

θ

].

A typical set of observations generated by this model for a track is shown in Figure 19(b).

3.2.1 Observation Models

Figure 20 shows the centralised data fusion architecture developed in the following threesections. A common model of the true state is provided in the usual linear discrete-timeform;

x(k) = F(k)x(k − 1) +G(k)u(k) + v(k), (139)

where x(·) is the state of interest, F(k) is the state transition matrix, G(k)the controlinput model, u(k) the control input vector, and v(k) a random vector describing modeland process uncertainty, assumed zero mean and temporally uncorrelated;

Ev(k) = 0, ∀k,

Ev(i)vT (j) = δijQ(i).(140)

It is important to emphasise that, because all sensors are observing the same state (therewould be little point in the data fusion problem otherwise) this process model must becommon to all sensors.

Observations of the state of this system are made synchronously by a number ofdifferent sensors according to a set of linear observation models in the form

zs(k) = Hs(k)x(k) +ws(k), s = 1, · · · , S. (141)


3 4 5 6 7 8

x 104

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8

x 104 True Target Motions

x−position (m)

y−po

sitio

n (m

)

Target True Position

(a)

6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8

x 104

5

5.2

5.4

5.6

5.8

6

6.2

6.4


x−position (m)

y−po

sitio

n (m

)

Target True PositionTracking Stations Target Observations

(b)

Figure 19: Typical tracks and observations generated by the ‘standard’ multi-target track-ing example of Example 22: (a) true x–y tracks; (b) detail of track observations.


w1

H1

Sensor 1

+

+ ws

Hs

Sensor s

+

+ wS

HS

Sensor S

+

+

...... ......

x(k)= Fx (k-1)+ Gu (k)+v(k)

z1 zSzs

Fusion CentreGenerating x(k|k)

Figure 20: The structure of a centralised data fusion architecture. Each sensorhas a different observation model but observes a common process. Observa-tions are sent in unprocessed form to a central fusion center that combinesobservations to provide an estimate of the common state.

where zs(k) is the observation made at time k by sensor s of a common state x(k)according to the observation model Hs(k) in additive noise ws(k). The case in whichthe observations made by the sensors are asynchronous, in which different sensors makeobservations at different rates, can be dealt with by using the observation model Hs(tk).

It is assumed that the observation noise models ws(k) are all zero mean, uncorrelatedbetween sensors and also temporally uncorrelated;

Ews(k) = 0, s = 1, · · · , S, ∀k

Ewp(i)wq(j) = δijδpqRp(i).(142)

It is also assumed that the process and observation noises are uncorrelated

Ev(i)wTs (j) = 0, ∀i, j, s. (143)

This assumption is not absolutely necessary. It is relatively simple, but algebraicallycomplex, to include a term accounting for correlated process and observation errors (see[12] Chapter 7, or [21].)


3.2.2 The Group-Sensor Method

The simplest way of dealing with a multi-sensor estimation problem is to combine allobservations and observation models in to a single composite ‘group sensor’ and thento deal with the estimation problem using an identical algorithm to that employed insingle-sensor systems. Defining a composite observation vector by

z(k)=[zT1 (k), · · · , zTS (k)

]T, (144)

and a composite observation model by

H(k)=[HT

1 (k), · · · ,HTs (k)]

]T, (145)

withw(k)

=[wT1 (k), · · · ,wTs (k)

]T, (146)

where from Equation 142

R(k) = Ew(k)wT (k)

= E[wT1 (k), · · · ,wTs (k)

]T [wT1 (k), · · · ,wTs (k)

]

= blockdiagR1(k), · · · ,Rs(k),

(147)

the observation noise covariance is block-diagonal with blocks equal to the individualsensor observation noise covariance matrices. The set of observation equations definedby Equation 141 may now be rewritten as a single ‘group’ observation model in standardform

z(k) = H(k)x(k) +w(k). (148)

With a process model described by Equation 139 and a group observation model definedby Equation 148, estimates of state can in principle be computed using the standardKalman filter algorithm given by Equations 94, 95, and 96.

Example 23Consider again the tracking of a particle moving with constant velocity with process

model as defined in Example 17. Suppose we have two sensors, the first observing theposition and the second observing the velocity of the particle. The two observation modelswill be

z1 = [ 1 0 ][x(k)x(k)

]+ w1(k), Ew2

1(k) = σ2r1,

and

z2 = [ 0 1 ][x(k)x(k)

]+ w2(k), Ew2

2(k) = σ2r2.

The composite group sensor model is simply given by[z1

z2

]=[1 00 1

] [x(k)x(k)

]+[w1

w2

], E

[w1

w2

][w1 w2 ] =

[σ2r1

00 σ2

r2

].


We could add a third sensor making additional measurements of position according to theobservation model

z3 = [ 1 0 ][x(k)x(k)

]+ w3(k), Ew2

3(k) = σ2r3,

in which case the new composite group sensor model will be given by

z1

z2

z3

=

1 00 11 0

[x(k)x(k)

]+

w1

w2

w3

,

Ew1

w2

w3

[w1 w2 w3 ] =

σ

2r1

0 00 σ2

r20

0 0 σ2r3

.

It should be clear that there is no objection in principle to incorporating as many sensorsas desired in this formulation of the multi-sensor estimation problem.

The prediction phase of the multi-sensor Kalman filter algorithm makes no reference tothe observations that are made and so is identical in every respect to the prediction phaseof the single-sensor filter. However, the update phase of the cycle, which incorporatesmeasurement information from sensors, will clearly be affected by an increase in thenumber of sensors. Specifically, if we have a state vector x(k) of dimension n and Ssensors each with an observation vector zs(k), s = 1, · · · , S of dimension ms togetherwith an observation model Hs(k) of dimension ms× n then the group observation vectorz(k) will have dimension m =

∑Ss=1 ms and the group observation model H(k) will have

dimension m × n. The consequence of this lies in Equations 94, 95 and 96. Clearly

the group-sensor innovation ν(k)=[νT1 (k), · · · , νT1 (k)

]Twill now have dimension m, and

the group-sensor innovation covariance matrix S(k) will have dimension m×m. As thenumber of sensors incorporated in the group sensor model increases, so does the dimensionof the innovation vector and innovation covariance matrix. This is a problem because theinverse of the innovation covariance is required to compute the group-sensor gain matrixW(k) in Equation 96. It is well known that the calculation of a matrix inverse increasein proportion to the square of its dimension.

For a small number of sensors and so for a relatively small innovation dimension, thegroup-sensor approach to multi-sensor estimation may be the most practical to implement.However, as the number of sensors increases, the value of this monolithic approach to thedata fusion problem becomes limited.

Example 24It is common practice to use linear models to track targets that are clearly not linear,

particularly in multiple-target tracking problems. Consider again the tracks and obser-vations generated by Example 22. The range and bearing observation vector zij(k) =


[zrij(k), zθij(k)]

T can be converted into an equivalent observation vector in absolute carte-sian coordinates as [

zxij(k)zyij(k)

]=[Xi(k) + zrij(k) cos z

θij(k)

Yi(k) + zrij(k) sin zθij(k)

].

The observation covariance matrix can also be converted into absolute cartesian coordi-nates using the relationship

Rxy(k) =[σ2xx σxy

σxy σ2yy

]= Rot(zθij(k))

[σ2r 00 (zrij(k))

2σ2θ

]RotT (zθij(k))

where Rot(θ) is the rotation matrix

Rot(θ) =[cos θ − sin θsin θ cos θ

].

Note now that observation variance is strongly range dependent and is lined up with thesensor bore-sight.

Once the observation vector has been converted into a global cartesian coordinate frame,a linear filter can be used to estimate a linear target track. The two-dimensional particlemodel of Example 17 with process model defined in Equation 80 and with an observationmodel (now not constant) defined in Equation 81, can be used in the filter defined byEquations 87–89 in Example 18.

Figure 21 shows the results of tracking four targets from two (stationary) sensor sites.First note that the state variances and innovation variances are not constant because theobservation variances are strongly dependent on the relative position of observer and tar-get. The asynchronous nature of the observations also contributes to this non-constancy.It can be observed in Figure 21(c) and (d) the rising and falling as variance values as thetarget shown comes closer to and then further away from the tracking sites.

The algorithm implemented here is equivalent to the group sensor method, but theresults will also be the same for the sequential and inverse covariance algorithms.

3.2.3 The Sequential-Sensor Method

A second approach to the multi-sensor estimation problem is to consider each sensor obser-vation as an independent, sequential update to the state estimate and for each observationto compute an appropriate prediction and gain matrix. In contrast to the group-sensorapproach in which a single sensor model was constructed by combining all sensor mod-els, the sequential update approach considers each sensor model individually, one by one.This means that the dimension of the innovation and innovation covariance matrix ateach update stage remains the same size as their single-sensor equivalents at the cost ofcomputing a new gain matrix for each observation from each sensor.

The description of the state to be estimated is assumed in the form of Equation 139.Observations are made of this common state by S sensors according to Equation 141. It


4.5 5 5.5 6 6.5 7 7.5 8

x 104

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5


x−position (m)

y−po

sitio

n (m

)

Target True PositionTracking Stations Target Observations

(a)

6.05 6.1 6.15 6.2 6.25 6.3 6.35 6.4

x 104

5.5

5.6

5.7

5.8

5.9

6

x 104 Estimated Tracks

x−position (m)

y−po

sitio

n (m

)

Target True PositionTracking Stations Target Observations Estimated Tracks

(b)

100 150 200 250 300 350 400

−800

−600

−400

−200

0

200

400

600

800

x−innovation for Track 1 from Tracking Site 1

Time (s)

Inno

vatio

n (m

)

(c)

50 100 150 200 250 300 350 400 450−200

−150

−100

−50

0

50

100

150

x−estimate error for Track 1 from all Tracking Sites

Time (s)

Err

or (

m)

(d)

Figure 21: A multiple sensor multiple target tracking example with four targets and twotracking stations: (a) True state and asynchronous observations; (b) Detail of true states,track estimates, and observations; (c) Innovations and innovation standard deviations fora particular track and tracking station; (d) Track position estimates from all sites togetherwith standard deviations (estimated errors).


is assumed that every sensor takes an observation synchronously at every time step. Thecase of asynchronous observations is straight-forward in this sequential-sensor algorithm.

It is assumed that it is possible to consider the observations made by the sensors atany one time-step in a specific but otherwise arbitrary order so that the observation zp(k),1 ≤ p ≤ S, from the pth sensor will be before the observation zp+1(k) from the (p + 1)th.The set of observations made by the first p sensors at time k is denoted by (caligraphicnotation is used to denote sets across sensors compared to bold-face notation for setsassociated with a single sensor)

Zp(k) = z1(k), · · · , zp(k) , (149)

so that at every time-step the observation set ZS(k) is available to construct a new stateestimate. The set of all observations made by the pth sensor up to time k will be denotedby

Zkp= zp(1), · · · , zp(k) , (150)

and the set of all observations made by the first p sensors up to time k by

Zkp = Zk1, · · · ,Zkp , p ≤ S. (151)

Of particular interest is the set of observations consisting of all observations made by thefirst p sensors up to a time i and all observations made by the first q sensors up to a timej

Z i,jp,q = Z ip ∪ Zjq = Zi1, · · · ,Zip,Zjp+1, · · · ,Zjq , p < q, i ≥ j. (152)

and in particular the set consisting of all observations made by all sensors up to a timek − 1 together with the observations made by the first p sensors up to time k

Zk,k−1p,S = Zkp ∪ Zk−1

S

= Zk1, · · · ,Zkp,Zk−1p+1, · · · ,Zk−1

S

= Zp(k) ∪ Zk−1S , p < S.

(153)

The estimate constructed at each time-step on the basis of all observations from all sensorsup to time k−1 and on the observations made by the first p sensors up to time k is definedby

x(k | k, p) = Ex(k) | Zk,k−1p,S , (154)

where in particular

x(k | k, 0) = x(k | k − 1) = Ex(k) | Zk−1S , (155)

is the prediction of the state at time k before any observations are made and

x(k | k, S) = x(k | k) = Ex(k) | ZkS , (156)


is the estimate at time k based on the observations made by all sensors.It is assumed that Equations 140, 142, and 143 hold. For the first observation con-

sidered at any one time-step, the prediction stage for the sequential-sensor estimator isvery similar to the prediction stage for the single-sensor estimator. The state predictionis simply

x(k | k, 0) = x(k | k − 1)

= F(k)x(k − 1 | k − 1) +G(k)u(k)

= F(k)x(k − 1 | k − 1, S) +G(k)u(k),

(157)

with corresponding covariance

P(k | k, 0) = P(k | k − 1)

= F(k)P(k − 1 | k − 1)FT (k) +Q(k)

= F(k)P(k − 1 | k − 1, S)FT (k) +Q(k).

(158)

The new estimate found by integrating the observation z1(k) made by the first sensor attime k is then simply given by

x(k | k, 1) = x(k | k, 0) +W1(k) [z1(k)−H(i)kx(k | k, 0)] (159)

with covarianceP(k | k, 1) = P(k | k, 0)−W1(k)S1(k)W

T1 (k), (160)

whereW1(k) = P(k | k, 0)HT

1 (k)S−11 (k) (161)

andS1(k) = H1(k)P(k | k, 0)HT

1 (k) +R1(k) (162)

To now integrate the observation made by the second sensor we need to employ theestimate x(k | k, 1) to generate a prediction. However, if the observations made by thesensors are assumed synchronous, the state does not evolve between successive sensorreadings from the same time-step and so F(k) = 1 and Q(k) = 0 (this is not true inthe asynchronous case). This means that the estimate x(k | k, 1) is itself the predictionfor the next update. In general, the estimate of the state x(k | k, p) at time k based onobservations made by all sensors up to time k − 1 and by the first p sensors up to time kcan be computed from the estimate x(k | k, p− 1) as

x(k | k, p) = x(k | k, p− 1) +Wp(k) [zp(k)−Hp(k)x(k | k, p− 1)] (163)

with covarianceP(k | k, p) = P(k | k, p)−Wp(k)Sp(k)W

Tp (k), (164)


whereWp(k) = P(k | k, p− 1)HT

p (k)S−1p (k) (165)

andSp(k) = Hp(k)P(k | k, p− 1)HT

p (k) +Rp(k) (166)

The state update at each time-step could be computed in a batch-mode by explicitlyexpanding Equation 163 to become

x(k | k) =[∏S

s=1 (1−Wi(k)Hi(k))]x(k | k − 1)

+∑Si=1

[∏Sj=i+1 (1−Wj(k)Hj(k))

]Wi(k)zi(k)

(167)

The case in which the observations are not synchronous may be dealt with in a similarway to the single-sensor asynchronous case providing that due care is taken to generate aproper prediction at each time each sensor records an observation.

As with the group-sensor approach to multi-sensor estimation, this method workswell for small numbers of sensors8. However the amount of computation that must beperformed increases with an increase in the number of sensors as a new gain matrixmust be computed for each observation from each sensor at each time step. Althoughthis increase is linear (the dimension of the innovation covariance matrix which must beinverted does not increase), it may still become a problem when large numbers of sensorsare employed.

3.2.4 The Inverse-Covariance Form

Neither the group-sensor nor the sequential-sensor algorithms are of much help when thenumber of sensors becomes large. In such cases it would be useful to find an explicit set ofequations and provide an algorithm which would allow the direct integration of multipleobservations into a single composite estimate.

The multi-sensor estimation problem would be considerably simplified if it were possi-ble to write the estimate as a simple linear combination of the innovations and predictionin the standard Kalman filter form. Unfortunately

x(k | k) = x(k | k − 1) +S∑i=1

Wi(k) [zi(k)−Hi(k)x(k | k − 1)] , (168)

withWi(k) = P(k | k − 1)HT

i (k)S−1(k), i = 1, · · · , S,

andSi(k) = Hi(k)P(k | k − 1)HT

i (k) +Ri(k), i = 1, · · · , S.8Indeed the sequential-sensor method is to be preferred when the observations are not synchronous as

in the asynchronous case a new prediction and gain matrix must be computed for each new observationregardless.


If the equality in Equation 168 were to hold, the group-sensor gain matrix would need tobe in block-diagonal form. For the gain-matrix to be block-diagonal, Equation 96 showsthat the group-sensor innovation covariance matrix must also be in block-diagonal form.Unfortunately, this is never the case. For example with two sensors, the group-sensorobservation vector is

z(k) = [zT1 (k), zT2 (k)]

T,

the group-sensor observation matrix is

H(k) = [HT1 (k),H

T2 (k)]

T,

and the group-sensor observation noise covariance matrix

R(k) = blockdiagR1(k),R2(k).The group sensor innovation covariance is thus

S(k) = H(k)P(k | k − 1)HT (k) +R(k)

=[H1(k)H2(k)

]P(k | k − 1) [HT

1 (k) HT2 (k) ] +

[R1(k) 00 R2(k)

]

=

H1(k)P(k | k − 1)HT

1 (k) +R1(k)H2(k)P(k | k − 1)HT

1 (k)H1(k)P(k | k − 1)HT2 (k)

H2(k)P(k | k − 1)HT2 (k) +R1(k)

,

(169)

which is clearly not block-diagonal; so the gain matrix

W(k) = P(k | k − 1)HT (k)S−1(k)

will also not be block-diagonal.Fundamentally, the reason why Equation 168 may be used to integrate single-sensor

observations over time but not for integrating multiple observations at a single time-step is that the innovations are correlated9. This correlation is due to the fact that theinnovations at a single time-step have a common prediction. This correlation is reflectedin the off-diagonal terms of Equation 169 of the form Hi(k)P(k | k − 1)HT

j (k).The fact that the innovations generated by each sensor are correlated results in sig-

nificant problems in the implementation of multi-sensor estimation algorithms. Ignoringcorrelations between innovations (assuming equality in Equation 168 and constructingan estimate from a simple weighted sum of innovations) will result in disaster. This isbecause the information due to the prediction will be ‘double counted’ in each updatethus implying considerably more information (and confidence) than is actually the case.The use of ‘forgetting factors’ or ‘fading memory filters’ are routinely used to address thisissue, although these are really no more than domain-specific hacks. It should also bepointed out that the correlatedness of innovations in multi-sensor estimation exposes thepopular fallacy that it is always ‘cheaper’ to communicate innovations in such algorithms.

9the innovations from one time-step to the next are uncorrelated but the innovations generated bymany sensors at a single time are correlated.


Example 25For a two-sensor system we can explicitly evaluate the combined innovation covariance

matrix using the matrix inversion lemma and compute the appropriate gain matrices forthe innovations of both sensors. Dropping all time subscripts, the inverse innovationcovariance may be written as [24]

S−1 =[S11 S12

ST12 S22

]−1

=[

∆−1 −∆−1S12S−122

−S−122 S

T12∆

−1 S−122 + S−1

22 ST12∆

−1S12S−122

], (170)

where∆ = S11 − S12S

−122 S

T12.

Making appropriate substitutions from Equation 169, and employing the matrix inversionlemma twice, we have

∆−1 =[S11 − S12S

−122 S

T12

]−1

=[H1PH

T1 +R1 −H1PH

T2

[H2PH

T2 +R2

]−1H2PH

T1

]−1

=[R1 +H1

(P−PHT

2

[H2PH

T2 +R2

]−1H2P

)HT

1

]−1

=[R1 +H1

[P−1 +HT

2R−12 H2

]−1HT

1

]−1

= R−11 −R−1

1 H1

[P−1 +HT

1R−11 H1 +HT

2R−12 H2

]−1HT

1R−11 .

The two gain matrices may now be computed from

[W1 W2 ] = P[HT

1

HT2

] [S11 S12

ST12 S22

]−1

.

Substituting in from Equations 169 and 170 we have for the first gain

W1 = PHT1∆

−1 −PHT2

[H2PH

T2 +R2

]−1H2PH

T1∆

−1

=(P−PHT

2

[H2PH

T2 +R2

]−1H2P

)HT

1∆−1

=[P−1 +HT

2R−12 H2

]−1HT

1∆−1

=[P−1 +HT

2R−12 H2

]−1

×(1−HT

1R−11 H1

[P−1 +HT

1R−11 H1 +HT

2R−12 H2

]−1)HT

1R−11

=[P−1 +HT

1R−11 H1 +HT

2R−12 H2

]−1HT

1R−11 ,


and similarly for the second gain

W2 =[P−1 +HT

1R−11 H1 +HT

2R−12 H2

]−1HT

2R−12

To derive a set of explicit equations for multi-sensor estimation problems, we couldbegin by employing the matrix inversion lemma to invert the innovation matrix for thetwo-sensor case given in Equation 169, and then proceed to simplify the equation for thegroup-sensor gain matrix. However, it is easier in the first instance to write the gain andupdate equations for the group-sensor system in inverse covariance form.Rewriting the weights associated with the prediction and observation.

I−W(k)H(k) = [P(k | k − 1)−W(k)H(k)P(k | k − 1)]P−1(k | k − 1)

= [P(k | k − 1)−W(k)S(k) (S−1(k)H(k)P(k | k − 1))]

×P−1(k | k − 1)

=[P(k | k − 1)−W(k)S(k)WT (k)

]P−1(k | k − 1)

= P(k | k)P−1(k | k − 1).

(171)

Similarly,

W(k) = P(k | k − 1)HT (k)[H(k)P(k | k − 1)HT (k) +R−1(k)

]−1

W(k)[H(k)P(k | k − 1)HT (k) +R−1(k)

]= P(k | k − 1)HT (k)

W(k)R(k) = [1−W(k)H(k)]P(k | k − 1)HT (k)

soW(k) = P(k | k)HT (k)R−1(k). (172)

Substituting Equations 171 and 172 into Equation 94 gives the state update equation as

x(k | k) = P(k | k)[P−1(k | k − 1)x(k | k − 1) +HT (k)R−1(k)z(k)

]. (173)

From Equations 95, 96 and 171 we have

P(k | k) = [I−W(k)H(k)]P(k | k − 1)[I−W(k)H(k)]T +W(k)R(k)WT (k). (174)

Substituting in Equations 171 and 172 gives

P(k | k) = [P(k | k)P−1(k | k − 1)]P(k | k − 1)[P(k | k)P−1(k | k − 1)]T

+[P(k | k)HT (k)R−1(k)

]R(k)

[P(k | k)HT (k)R−1(k)

]T.

(175)


Pre and post multiplying by P−1(k | k) then simplifying gives the covariance updateequation as

P(k | k) =[P−1(k | k − 1) +HT (k)R−1(k)H(k)

]−1. (176)

Thus, in the inverse-covariance filter10, the state covariance is first obtained from Equation176 and then the state itself is found from Equation 173.

Consider now S sensors, each observing a common state according to

zs(k) = Hs(k)x(k) + vs(k), s = 1, · · · , S, (177)

where the noise vs(k) is assumed to be white and uncorrelated in both time and betweensensors;

Evs(k) = 0,

Evs(i)vp(j) = δijδspRi(k) s, p = 1, · · · , S; i, j = 1, · · · .(178)

A composite observation vector z(k) comprising a stacked vector of observations from theS sensors may be constructed in the form

z(k) =

z1(k)· · ·

zTS (k)

, (179)

a composite observation matrix H(k) comprising the stacked matrix of individual sensorobservation matrices

H(k) =

H1(k)· · ·

HS(k)

, (180)

and a composite noise vector v(k) comprising a stacked vector of noise vectors from eachof the sensors

v(k) =

v1(k)· · ·

vTS (k)

. (181)

From Equation 178, the covariance in this composite noise vector is a block diagonalmatrix

R(k) = Ev(k)vT (k) = blockdiag (R1(k), · · · ,RS(k)) . (182)

With these definitions, we have

HT (k)R−1(k)z(k) = [HT1 (k) HT

2 (k) · · · HTS (k) ]

×

R−1

1 (k) 0 · · · 00 R−1

2 (k) · · · 0...

.... . .

...0 0 · · · R−1

S (k)

z1(k)z2(k)...

zS(k)

=∑Si=1 H

Ti (k)R

−1i (k)zi(k),

(183)

10Strictly, the inverse covariance filter is defined using the inverse of Equation 176 [28].


and

HT (k)R−1(k)H(k) = [HT1 (k) HT

2 (k) · · · HTS (k) ]

×

R−1

1 (k) 0 · · · 00 R−1

1 (k) · · · 0...

.... . .

...0 0 · · · R−1

S (k)

H1(k)H2(k)

...HS(k)

=∑Si=1 H

Ti (k)R

−1i (k)Hi(k).

(184)

Substituting Equation 183 into Equation 173, the state update equation becomes

x(k | k) = P(k | k)[P−1(k | k − 1)x(k | k − 1) +

S∑i=1

HTi (k)R

−1i (k)zi(k)

]. (185)

Substituting Equation 184 into Equation 176, the state-covariance update equation be-comes

P(k | k) =[P−1(k | k − 1) +

S∑i=1

HTi (k)R

−1i (k)Hi(k)

]−1

. (186)

The multi-sensor inverse-covariance filter thus consists of a conventional state and state-covariance prediction stage given by Equations 85 and 86, followed by a state and state-covariance update from Equations 185 and 186.

Example 26Consider again the multi-sensor tracking problem of Example 23 with three sensors:

z1 = [ 1 0 ][x(k)x(k)

]+ w1(k), Ew2

1(k) = σ2r1,

z2 = [ 0 1 ][x(k)x(k)

]+ w2(k), Ew2

2(k) = σ2r2,

and

z3 = [ 1 0 ][x(k)x(k)

]+ w3(k), Ew2

3(k) = σ2r3.

Thus the group sensor model is given by

H =

H1

H2

H3

=

1 00 11 0

,

R =

R1 0 0

0 R2 00 0 R3

=

σ

2r1

0 00 σ2

r20

0 0 σ2r3

.


It follows that

HTR−1z =[1 0 10 1 0

]

1σ2

r1

0 0

0 1σ2

r2

0

0 0 1σ2

r3

z1

z2

z3

=

[ z1σ2

r1

+ z3σ2

r3z2σ2

r2

]=

3∑i=1

HTi R

−1i zi,

and

HTR−1H =[1 0 10 1 0

]

1σ2

r1

0 0

0 1σ2

r2

0

0 0 1σ2

r3

1 00 11 0

=

1σ2

r1

+ 1σ2

r3

0

0 1σ2

r2

=3∑i=1

HTi R

−1i Hi,

The advantages of the information-filter form over the group-sensor and asynchronousapproaches to the multi-sensor estimation problem derive directly from the formulation ofthe update equations. Regardless of the number of sensors employed, the largest matrixinversion required is of dimension the state vector. The addition of new sensors simplyrequires that the new terms HT

i (k)R−1i (k)zi(k) and HT

i (k)R−1i (k)Hi(k) be constructed

and added together. Thus the complexity of the update algorithm grows only linearlywith the number of sensors employed. In addition, the update stage can take place in onesingle step.

The advantages of the inverse-covariance estimator only become obvious when sig-nificant numbers of sensors are employed. It is clear from Equation 186 that both theprediction covariance and the updated inverse covariance matrix must be inverted in eachcycle of the filter, and so the inverse-covariance filter obtains an advantage only when thedimension of the composite observation vector is approximately two times the dimensionof the common state vector. As we shall see later, it is possible to write the predic-tion stage directly in terms of the inverse covariance, although this does not significantlyreduce the amount of computation required.

3.2.5 Track-to-Track Fusion

Track-to-track fusion encompasses algorithms which combine estimates from sensor sites.This is distinct from algorithms that combine observations from different sensors; theformer is often called track fusion, the latter “scan” fusion. In track-to-track fusionalgorithms, local sensor sites generate local track estimates using a local Kalman filter.These tracks are then communicated to a central fusion site where they are combined to


Sensors

z1(k) x1(k|k), P 1(k|k)^Local Tracker

zn(k) xn(k|k), P n(k|k)^Local Tracker

z2(k) x2(k|k), P 2(k|k)^Local Tracker

Track-to-Track Fusion

Algorithm

xT(k|k), P T(k|k)^

Figure 22: A typical track-to-track fusion architecture in which local tracks are generatedat local sensor sites and then communicated to a central fusion centre where a globaltrack file is assimilated.

generate a global track estimate. A typical track-to-track fusion architecture is shownin Figure 22 In some configurations, the global estimate is then communicated back tothe local sensor sites (called a feed-back configuration). Track-to-track fusion algorithmshave a number of potential advantages over scan fusion methods:

1. Local track information is made available for use locally at each sensor sites.

2. Track information may be communicated at a lower, more compact, rate to thecentral site for fusion.

However track-to-track fusion algorithms also add additional complexity to a system.Detailed expositions of the track-to-track fusion method can be found in [4, 9].

The track-to-track fusion method begins by assuming a common underlying model ofthe state being observed in the standard form

x(k) = F(k)x(k − 1) +B(k)u(k) +G(k)v(k). (187)

The state is assumed to be observed by a number of sensors each with different observationmodels in the form

zi(k) = Hi(k)x(k) +wi(k), i = 1, 2, (188)

but where observation noises are assumed independent.

Ewi(k)wj(k) = δijRi(k). (189)

A local track is formed by each sensor node, on the basis of only local observations, usingthe normal Kalman filter algorithm as

xi(k | k) = xi(k | k − 1) +Wi(k) [zi(k)−Hi(k)xi(k | k − 1)] , (190)


andPi(k | k) = Pi(k | k − 1)−Wi(k)Si(k)W

Ti (k), (191)

whereWi(k) = Pi(k | k − 1)HT

i (k)S−1i (k), (192)

andSi(k) =

[Hi(k)Pi(k | k − 1)HT

i (k) +Ri(k)]. (193)

Local state predictions are generated from a common state model

xi(k | k − 1) = F(k)xi(k − 1 | k − 1) +B(k)u(k) (194)

andPi(k | k − 1) = F(k)Pi(k − 1 | k − 1)FT (k) +G(k)Q(k)GT (k) (195)

A straight-forward track fusion algorithm is simply to take the variance weightedaverage of tracks as follows:

xT (k | k) = PT (k | k)N∑i=1

P−1i (k | k)xi(k | k) (196)

PT (k | k) =[N∑i=1

P−1i (k | k)

]−1

. (197)

This is a commonly practically used track-to-track fusion method.However, a central issue in track-to-track fusion is that any two tracks xi(k | k) and

xj(k | k) are correlated because they have a common prediction error resulting from acommon process model. This correlation is intrinsic to the problem; it is only because thestates have this process model in common that there is a reason to fuse the two. Thus itis not possible in general to compute a fused track xT (k | k) from a simple linear weightedsum of local tracks.

To address the general track-to-track fusion problem, it is necessary to find an expres-sion for the correlation between two tracks. When this is known, it is possible to fuse thetwo estimates. To begin, we compute an expression for the track estimate and predictionerrors as

xi(k | k) = x(k)− xi(k | k)= x(k)− xi(k | k − 1)−Wi(k) [zi(k)−Hi(k)xi(k | k − 1)]

= x(k)− xi(k | k − 1)−Wi(k) [Hi(k)x(k) +wi(k)−Hi(k)xi(k | k − 1)]

= [1−Wi(k)Hi(k)] xi(k | k − 1)−Wi(k)wi(k) (198)

and

xi(k | k − 1) = x(k)− xi(k | k − 1)

= F(k)x(k − 1)− F(k)xi(k − 1 | k − 1) +G(k)v(k)

= F(k)xi(k − 1 | k − 1) +G(k)v(k) (199)


Substituting Equations 206 and 207 into Equation 204 gives

xT (k | k) = xi(k | k) + [Pi(k | k)−Pij(k | k)]×[Pi(k | k) +Pj(k | k)−Pij(k | k)−PTij(k | k)

]−1

× [xj(k | k)− xi(k | k)] (208)

as the combined track estimate, and

PT (k | k) = Pi(k | k)− [Pi(k | k)−Pij(k | k)]×[Pi(k | k) +Pj(k | k)−Pij(k | k)−PTij(k | k)

]−1

× [Pi(k | k)−Pij(k | k)]T (209)

Equations 208 and 209 are in the form of predictor-corrector equations. As written,xi(k | k) is the predicted track, and xj(k | k)− xi(k | k) is the correction term, weightedby a gain proportional to the corresponding covariances. The equations are symmetric sothat the role of i and j are interchangeable.

There are a number of basic extensions to the basic track-to-track fusion algorithm.Notable is the use of “equivalent” measurements described in [9]. However, these methodsare more appropriately described in context of the information filter.

3.3 Non-linear Data Fusion Methods

In a wide number of practical problems, the errors in process model predictions andin observation model updates are not easily described by symmetric and well-behavedGaussian distributions. Further, the process and observation models themselves may notbe linear or even smooth in the state. In such cases the Kalman filter, as a methodfor fusing information, may not be the appropriate data fusion algorithm for the task.Instances where this will be the case include:

• Systems in which the process or observation model are highly sensitive to the stateprediction values and thus the point around which an extended Kalman filter maybe linearised.

• Systems in which the process model is discontinuous, and indeed might have beenformed from fitting of lines or curves to experimental data.

• Systems in which the error models for either process or observation or highly skewedso that the mean and the most likely values are substantially different.

• Systems in which the error models are multi-modal, having more than one “likely”value.

In general, the most appropriate course of action in these situations is to return to BayesTheorem and seek an estimator which can more readily deal with either non Gaussianerrors or with non-linear process or observation models.


3.3.1 Likelihood Estimation Methods

The recursive estimation problem can be derived entirely in probabilistic form using Bayestheorem. The general estimation problem requires that the probability distribution

P (xk | Zk,Uk,x0) (210)

be computed for all times k. This probability distribution describes the posterior densityof vehicle state (at time k) given the recorded observations and control inputs up toand including time k together with the initial state. To develop a recursive estimationalgorithm, a prior P (xk−1 | Zk−1,Uk−1) at time k−1 is assumed. The posterior, following acontrol uk and observation zk, is then computed using Bayes Theorem. This computationrequires that a state transition model and an observation model are defined.

The observation model is generally described in the form

P (zk | xk), (211)

and is assumed conditionally independent, so that

P (Zk | Xk) =k∏i=1

P (zi | Xk)

=k∏i=1

P (zi | xi). (212)

The transition model for the state can be described in terms of a probability distribu-tion on state transitions in the form

P (xk | uk,xk−1). (213)

The state transition may reasonably be assumed to be a Markov process in which thenext state xk depends only on the immediately proceeding state xk−1 and the appliedcontrol uk, and is independent of the observations made.

With these definitions and models, Bayes Theorem may be employed to define arecursive solution to Equation 210.

To derive a recursive update rule for the posterior, the chain rule of conditional prob-ability is employed to expand the joint distribution on the state and observation in termsof the state

P (xk, zk | Zk−1,Uk,x0) = P (xk | zk,Zk−1,Uk,x0)P (zk | Zk−1,Uk,x0)

= P (xk | ZkUk,x0)P (zk | Zk−1Uk,x0), (214)

and then in terms of the observation

P (xk, zk | Zk−1,Uk,x0) = P (zk | xk,Zk−1,Uk,x0)P (xk, | Zk−1,Uk,x0)

= P (zk | xk)P (xk | Zk−1,Uk,x0) (215)


where the last equality employs the assumptions established for the sensor model inEquation 211.

Equating Equations 214 and 215 and rearranging gives

P (xk | Zk,Uk,x0) =P (zk | xk)P (xk | Zk−1,Uk,x0)

P (zk | Zk−1,Uk). (216)

The denominator in Equation 216 is independent of the state and can therefore be set tosome normalising constant K. The total probability theorem can be used to rewrite thesecond term in the numerator in terms of the state process model and the joint posteriorfrom time-step k − 1 as

P (xk | Zk−1,Uk,x0) =∫P (xk,xk−1 | Zk−1,Ukx0)dxk−1

=∫P (xk | xk−1,Z

k−1,Uk,x0)P (xk−1 | Zk−1,Uk,x0)dxk−1

=∫P (xk | xk−1,uk)P (xk−1 | Zk−1,Uk−1,x0)dxk−1 (217)

where the last equality follows from the assumed independence of state transition fromobservations, and from the causality of the control input on state transition.

P (xk | Zk,Uk,x0) = K.P (zk | xk)∫P (xk | xk−1,uk)P (xk−1 | Zk−1,Uk−1,x0)dxk−1

(218)Equation 218 provides a recursive formulation of the posterior in terms of an obser-

vation model and a motion model. It is completely general in allowing any probabilisticand kinematic model for both state transition and observation. It also allows, post pos-terior, any estimation scheme (maximum likelihood, mini-max, minimum-mean-squared,for example) to be applied.

The following algorithms use this formulation of the estimation problem. They arebecoming increasingly important as available computational resources increase and as datafusion problems addressed become more complex. However, time precludes discussingthese algorithms in detail.

3.3.2 The Particle Filter

The particle filter is essentially a monte-carlo simulation of Equation 218. See in particular[40] for data fusion application of the particle filter method. This also incorporates otherrelated report information (such as shoreline boundaries) in full probabilistic form).

3.3.3 The Sum-of-Gaussians Method

The sum-of-Gaussians method replaces the distributions in the general formulation withan approximation as a sum of Gaussians. In principle, this can be used to model any dis-tribution. Further, there are efficient update rules for a sum-of-Gaussians approximation.See in particular [22, 1, 29].


3.3.4 The Distribution Approximation Filter (DAF)

The distribution approximation filter (DAF) is a recent method which provides an efficientmethod of approximating the non-linear transformation of Gaussians. See [23] for details.

3.4 Multi-Sensor Data Association

Hx (k-1|k-1)^

Hx (k|k-1)^

z(k)d2

d2=γ

Figure 23: The validation gate. The gate is defined in the observation space and is centeredon the observation prediction H(k)x(k | k − 1). The gate is the normalised innovation.For a fixed value of d2 = γ the gate is ellipsoidal and defined by the eigenvalues of theinverse innovation covariance. The size of the gate is set by requiring the probability ofcorrect association to be above a certain threshold. Observations that fall within this gateare considered ‘valid’.

In multi-sensor estimation problems, the many measurements made need to be cor-rectly associated with the underlying states that are being observed. This is the dataassociation problem.

The data association problem includes issues of validating data (ensuring it is noterroneous or arising from clutter11 for example), associating the correct measurements to

11The term clutter refers to detections or returns from nearby objects, weather, electromagnetic in-terference, acoustic anomalies, false alarms, etc., that are generally random in number, location, andintensity. This leads to the occurrence of (possibly) several measurements in the vicinity of the target.The implication of the single target assumption is that the undesirable measurements constitute ran-


the correct states (particularly in multi-target tracking problems), and initialising newtracks or states as required. Whereas conventional tracking is really concerned withuncertainty in measurement location, data association is concerned with uncertainty inmeasurement origin.

The foundations of data association lie in the use of the normalised innovation or’validation gate’. The innovation νij(k) is the difference between the observation zi(k)actually made, and that predicted by the filter zj(k | k − 1) = Hj(k)xj(k | k − 1);

νij(k) = zi(k)− zj(k | k − 1) = zi(k)−Hj(k)xj(k | k − 1).

The innovation variance is simply defined as

Sij(k) = Eνij(k)νTij(k)= Hj(k)Pj(k | k − 1)HT

j (k) +Ri(k).

The normalised innovation between an observation i and a track j is then given by

d2ij(k) = νTij(k)S

−1ij (k)νij(k). (219)

The normalised innovation is a scalar quantity. It can be shown that if the innovations arezero mean and white (the filter is operating correctly) then the normalised innovation isa χ2 random variable in nz degrees of freedom (the dimension of the observation vector).It is therefore possible to establish values for d2

ij which enclose a specific probability ofthe observation and predicted observation being correctly associated [7, 18]. This is keyto establishing data association methods. Referring to Figure 23, for a fixed value of d2

ij

the normalised innovation describes a quadratic centered on the prediction generatingan ellipsoidal volume (of dimension nz) whose axes are proportional to the reciprocal ofthe eigenvalues of S−1

ij (k). This ellipsoid defines an area or volume in observation spacecentered on the observation prediction. If an observation falls within this volume then itis considered ‘valid’; thus the term validation gate.

The normalised innovation serves as the basis for all data association techniques. Thereasons for this are clear. First, the innovation is practically the only measure of divergenceof state estimate from observation sequence. Second, it admits a precise probabilisticmeasure of correct association.

3.4.1 The Nearest-Neighbour Standard Filter

The nearest neighbour standard filter (NNSF) applies the obvious data association policy:It simply chooses the measurement ‘nearest’ to the predicted measurement as the onlyvalidated observation, and the remaining observations are rejected from consideration(Figure 24). The single validated measurement is used for updating the state estimateof the target. The definition of ‘nearest’ is the observation which yields the minimum

dom interference. A common mathematical model for such interference is a uniform distribution in themeasurement space.


Hx (k|k-1)^

z1(k)

d12

d32

d2

2

z3(k)

z2(k)

Figure 24: For a single target track, the nearest-neighbour standard filter selects theobservation ‘nearest’ the prediction for association. All other observations are rejected.The proximity measure is the normalised innovation.

normalised innovation as defined in Equation 219. If there are no observations withnormalised innovation values less than some defined d2

ij, then no observation is associated.The NNSF algorithm implicitly assumes a high probability of detection for the target anda low rate of clutter in the gate. In such situations, the NNSF algorithm is usuallyadequate and is practically used in many applications.

There are a number of problems with the basic NNSF algorithm particularly in situ-ations where there is a high degree of clutter and where there are potentially a numberof closely spaced target states to be tracked. Figure 25 shows a particular situation inwhich two track validation gates overlap. In this, and many other cases, the problem ofcorrectly associating an observation with a track is complex. This is first because theobvious pair-wise closest association is not necessarily correct, and second because associ-ation is order dependent and so all associations must be considered together to achieve acorrect assignment (even when the probability of detect is unity and the false alarm rateis zero). The following two algorithms are the most widely used and common methods ofovercoming these two problems (see [7, 9] for detailed expositions of the data associationproblem).

3.4.2 The Probabilistic Data Association Filter

The Probabilistic Data Association Filter (PDAF) and Joint Probabilistic Data Associ-ation Filter (JPDAF) are Bayesian algorithms which compute a probability of correctassociation between an observation and track. This probability is then used to form a


z1(k)

z3(k)z

2(k)

Hx 1(k|k-1)^

Hx 2(k|k-1)^

d21-1

d22-1

d22-2

d21-2

d21-3

Figure 25: For multiple targets, each target track selects its nearest observation indepen-dently of other possible observation-track associations. This can easily lead to erroneousassignments in high target density situations.

weighted average ’correct track’.The PDAF algorithm starts with a set of validated measurements (those that fall

within the initial validation gate). Denoting the set of validated measurements at time kas

Z(k) ≡ zi(k)mk

i=1

where mk is the number of measurements in the validation region. The cumulative set ofmeasurements is denoted

Zk ≡ Z(j)kj=1

A basic (sub-optimal) assumption is made that the latest state is assumed normallydistributed according to the latest estimate and covariance matrix:

p[x(k) | Zk−1] = N[x(k); x(k | k − 1),P(k | k − 1)]

This assumption means that Z(k) contains measurements from the same elliptical vali-dation gate as used in the NNSF algorithm. This is shown in Figure 26.

In the PDAF algorithm, it is assumed that there is only one target of interest whosetrack has been initialized (the JPDAF algorithm deals with the more complex problemof coupled multi-target tracks). At each time step a validation region is set up. Amongthe possibly several validated measurements, one can be target originated, if the targetwas detected. The remaining measurements are assumed due to false alarms or residual


Hx (k|k-1)^ z1(k)β1

z2(k)

β2

x2(k|k)^

x1(k|k)^

β2x2(k|k)^

β1x1(k|k)^

x(k|k)^

Σ

Figure 26: The probabilistic data association algorithm associates all valid observationswith a track. For each validated observation, an updated estimate xjikk is computed. Aprobability of correct association βi is computed for each such track. Then a combinedtrack is formed from the weighted average of these tracks: x(k | k) = ∑

βixi(k | k). Formultiple targets, the same process occurs although the probability calculations are morecomplex.


clutter and are modeled as IID random variables with uniform density. Define θi(k) to bethe event that the measurement zi(k) originated from the target, and θ0(k) as the eventthat no measurement originated from the target. Let the probabilities of these events be

βi(k) ≡ P [θi(k) | Zk], i = 0, 1, · · · ,mk

The events are assumed mutually independent and exhaustive:mk∑i=0

βi(k) = 1

Using the total probability theorem with respect to the above events, the conditionalmean of the state at time k can be written as

x(k | k) = Ex(k) | Zk=

mk∑i=0

Ex(k) | θi(k), Zk P (θi(k) | Zk)

=mk∑i=0

xi((| k)k)βi(k) (220)

where xi(k | k) is the updated estimate conditioned on the event θi(k) that the ith vali-dated measurement is correct. This estimate is itself found from

xi(k | k) = x(k | k − 1) +W((k)i)νi(k), i = 1, · · · ,mk (221)

whereνi(k) = zi(k)−Hi(k)x(k | k − 1)

and if none of the measurements are correct, the estimate is

x0(k | k) = x(k | k − 1)

The PDAF algorithm now proceeds as follows:

1. The set of validated measurements is computed.

2. For each validated measurement an updated track is computed using Equation 221.

3. For each updated track an association probability βi is computed. The calculationof this probability can be quite complex and dependent on the assumed clutterdensities. However, it is normally adequate to set βi proportional to the normalisedinnovation for the association.

4. A combined (average) track is computed using Equation 220. A combined averagecovariance can also be computed although this can become quite complex (see [7]for details.

5. A single prediction of the combined track is then made to the next scan time andthe process is repeated.

The PDAF and JPDAF methods are appropriate in situations where there is a high degreeof clutter. As pointed out to the author in conversation once: “The great advantage withthe PDAF method is that you are never wrong. The problem is you are also never right”.


3.4.3 The Multiple Hypothesis Tracking (MHT) Filter

The Multiple Hypothesis Tracking (MHT) filter (and the related Track Splitting Filter(TSF)), maintain separate tracks for each possible associated observation. At each timestep, the predicted observation is used to establish a validation gate (see Figure 27).For each measurement that is found in this validation gate, a new hypothesis track isgenerated. Thus a single track is split in to n tracks, one associated with each validmeasurement plus one track (usually denoted 0) for the no-association hypothesis. Eachof these new tracks is then treated independently and used to generate new predictionsfor the next time step. Since the number of branches into which the track is split can growexponentially, the likelihood function of each split track is computed and the unlikely onesare discarded.

The MHT algorithm works on complete sequences of observations. That is, the prob-ability that a given branch sequence of observations (from root to leaf) is correct. Denotethe lth sequence of measurements up to time k as:

Zkl ≡ zi1,l(1), · · · , zik,l

(k)where zi(()j) is the ith measurement at time j. Let Θk,l be the event that the sequenceZkl is a correct track, then the likelihood function for this event is clearly

Λ(Θk,l) = P (Zk,l | Θkl) = P (zi1,l(1), · · · , zik,l

(k) | Θkl) (222)

Denoting Zk the cumulative set of all measurements up to time k, Equation 222 reducesto

Λ(Θk,l) =k∏j=1

P (zij,l| Zj−1Θk,l)

Under assumptions of linearity and Gaussianity, this reduces to

Λ(Θk,l) = ck exp

−1

2

k∑j=1

νT (j)S−1(j)ν(j)

where ν(j) = z(j)− z(j | j − 1) is the innovation between track and measurement.The corresponding modified log-likelihood function is given by

λ(k) ≡ −2 log[Λ(Θk,l)

ck

]=

k∑j=1

νT (j)S−1(j)ν(j) (223)

which can be recursively computed as

λ(k) = λ(k − 1) + νT (k)S−1(k)ν(k). (224)

The last term above is a measure of the “goodness of fit” of the measurements. A statis-tical test for accepting a track is that the log-likelihood function satisfies λ(k) < d.

The MHT algorithm now proceeds as follows


Hx (k|k-1)^

z1(k)

λ1

z2(k)

λ2

x2(k|k)^

x1(k|k)^

x0(k|k)^

λ0

z2(k+1)

z1(k+1)

x02(k|k)^

x22(k|k)^

x21(k|k)^

λ02

λ22

λ21

x(k-1|k-1)^

x10(k|k)^

λ10

Figure 27: In the track splitting or multiple hypothesis filter, every validated observationzp(k) is used to establish a new track xp(k | k). In addition the ’false alarm’ and/or ’missedobservation’ hypothesis also generates a track x0(k | k). These tracks are propagatedforward to the next gate and again each track is associated with each valid observationzq(k + 1) and the tracks are again split into tracks associated with each possible pair-wiseassociation xpq(k + 1 | k + 1). Probabilities or likelihoods λpq of correct track historiesare maintained to prune the resulting hypothesis tree.


1. A predicted observation is produced and a validation region around the predictionestablished.

2. All observations that fall within the validation gate are considered candidate asso-ciations. In addition, the no-association hypothesis ‘0’ is formed.

3. The track is updated with each validated hypothesis according to Equations 221.

4. A likelihood associated with correct association is computed according to Equation223.

5. The likelihood of association is recursively added to the likelihood of the predictionpath having been correctly associated according to Equation 224. This yields alikelihood of the entire sequence of observations to this point being correct.

6. Some pruning of the hypothesis tree may take place at this point.

7. Each track hypothesis is now independently predicted forward to the next time-step,effectively splitting the track equally into each possible competing hypothesis.

8. The process repeats.

There are three points to note about the MHT (and related TSF) algorithm:

• We have implicitly assumed that the detection probability of detection is unity.That is only complete sequences of measurements are considered in the track for-mation process. ¡issing observations can be taken into account by incorporating aprobability of detection at each stage.

• In practice, the likelihood pruning method does not work well with long measure-ment sequences as it becomes dominated by old measurements. One “hack” aroundthis is to use a fading-memory window.

• The method is clearly dominated by the computational and memory requirementsof the splitting algorithm. At each stage each hypothesis can generate many newhypotheses and filters that must be run in parallel.

The MHT algorithm is generally good in situation where there are low clutter ratesbut high track uncertainty (crossing tracks, maneuvering targets, etc). Practically, thealgorithm is dominated by the approach used for pruning unlikely target hypotheses.

3.4.4 Data Association in Track-to-Track Fusion

Data association methods can be extended to association of tracks rather than observa-tions. Such situations occur in distributed architectures and particularly in track-to-trackfusion algorithms.


Following the track-to-track fusion algorithm of Equations 208 and 209 a gate equiva-lent to Equation 219 can be established for testing of two tracks i and j can be associated

d2ij(k) = [xi(k | k)− xj(k | k)]

[Pi(k | k) +Pj(k | k)−Pij(k | k)−PTij(k | k)

]−1

× [xi(k | k)− xj(k | k)]T . (225)

This gate defines a similar ellipsoidal region to the normal validation gate. The teststatistic is also χ2 distributed, but in nx degrees of freedom (the dimension of the statevector).

Having established this gate, the normal NNSF type algorithm is usually used forassociating tracks. However, there is no objection in principle using either PDAF orMHT algorithms for a similar purpose.

3.4.5 Maintaining and Managing Track Files

In multiple target problems in particular, there is a need to maintain a record of howobservations and tracks are associated together. This is generally termed a Track File.Track files may be maintained centrally as a ’Joint Assignment Matrix’ (JAM)12. Trackfiles may also be maintained locally at a sensor site, although these must then be coordi-nated in some way. A detailed exposition of track file maintenance is beyond the scopeof this course (see [9] p603–605 for a detailed discussion).

12The Joint Assignment Matrix is used, algorithmically, in a number of advanced data associationmethods.


4 Distributed and Decentralised Data Fusion Sys-

tems

The section addresses the development of data fusion algorithms for distributed and de-centralised data fusion architectures.

The nature of data fusion is that there are a number of sensors physically distributedaround an environment. In a centralised data fusion system, raw sensor information is thencommunicated back to a central processor where the information is combined to producea single fused picture of the environment. In a distributed data fusion system, each sensorhas it’s own local processor which can generally extract useful information from the rawsensor data prior to communication. This has the advantage that less information isnormally communicated, the computational load on the central processor is reduced andthe sensors themselves can be constructed in a reasonably modular manner. The degreeto which local processing occurs at a sensor site varies substantially from simple validationand data compression up to the full construction of tracks or interpretation of informationlocally.

While for many systems a centralised approach to data fusion is adequate, the increas-ing sophistication, functional requirements, complexity and size of data fusion systems,coupled with the ever reducing cost of computing power argues more and more towardsome form of distributed processing. The central issue in designing distributed data fusionsystems is the development of appropriate algorithms which can operate at a number ofdistributed sites in a consistent manner. This is the focus of this section.

This section begins with a general discussion of data fusion architectures and the chal-lenges posed in developing distributed data fusion algorithms. The information filter, andmore generally the log-likelihood implementations of Bayes theorem are then developedand it is shown how these can readily be mapped to many distributed and decentraliseddata fusion systems. The issue of communication is a major problem in distributed datafusion systems. This is because of both limited communication bandwidth and also timedelays and communication failures that can occur between sensing and fusion processes.The communication problem is dealt with in depth in this section. Finally, some advancedissues in distributed and decentralised data fusion are briefly considered. These includethe problem of model distribution, where each local sensor maintains a different modelof the environment; sensor management, in which a limited set of sensor resources mustbe used cooperatively; and system organisation, in which the optimal design of sensornetworks is addressed.

4.1 Data Fusion Architectures

Distributed data fusion systems may take many forms. At the simplest level, sensors couldcommunicate information directly to a central processor where it is combined. Little orno local processing of information need take place and the relative advantage of havingmany sources of information is sacrificed to having complete centralised control over theprocessing and interpretation of this information. As more processing occurs locally, so


computational and communication burden can be removed from the fusion center, but atthe cost of reduced direct control of low-level sensor information.

Increasing intelligence of local sensor nodes naturally results in a hierarchical structurefor the fusion architecture. This has the advantage of imposing some order on the fusionprocess, but the disadvantage of placing a specific and often rigid structure on the fusionsystem.

Other distributed architectures consider sensor nodes with significant local ability togenerate tracks and engage in fusion tasks. Such architectures include ‘Blackboard’ andagent based systems.

Fully decentralised architectures have no central processor and no common communi-cation system. In such systems, nodes can operate in a fully autonomous manner, onlycoordinating through the anonymous communication information.

The following sections consider these architectures and their associated data fusionalgorithms.

4.1.1 Hierarchical Data Fusion Architectures

In a hierarchical structure, the lowest level processing elements transmit informationupwards, through successive levels, where the information is combined and refined, untilat the top level some global view of the state of the system is made available. Suchhierarchical structures are common in many organisations and have many well-knownadvantages over fully centralised systems; particularly in reducing the load on a centralisedprocessor while maintaining strict control over sub-processor operations.

z1(k)

z2(k)

zN(k)

Sensors

TrackingSystem 2

TrackingSystem 1

TrackingSystem N

x1(k)^

x2(k)^

xN(k)^

Tracks

Track FusionAlgorithms

CombinedTrack

Figure 28: Single Level Hierarchical Multiple Sensor Tracking System

The hierarchical approach to systems design has been employed in a number of datafusion systems and has resulted in a variety of useful algorithms for combining informationat different levels of a hierarchical structure. General hierarchical Bayesian algorithms arebased on the independent likelihood pool architectures shown in Figures 4 and 5, or on thelog-likelihood opinion pools shown in Figures 8 and 9. Here the focus is on hierarchicalestimation and tracking algorithms (See Figures 28 and 29).


Site-levelTrack Fusion

GroupPictureTracks

z1(k)

zn(k) TrackingSystem n

TrackingSystem 1

x1(k)^

xn(k)^

z1(k)

zn(k) Tracking

System n

TrackingSystem 1

x1(k)^

xn(k)^

z1(k)

zn(k) Tracking

System n

TrackingSystem 1

x1(k)^

xn(k)^

CombinedPictureTracks

GroupPictureTracks

Figure 29: Multiple Level Hierarchical Multiple Sensor Tracking System


First it is assumed that all sensors are observing a common state or track x(k). Ob-servations are made at local sites of this common state according to a local observationequation in the form

zi(k) = Hi(k)x(k) +wi(k), i = 1, · · · , S (226)

In principle, each site may then operate a conventional Kalman filter or state estimatorto provide local state estimates based only on local observations in the form

xi(k | k) = xi(k | k − 1) +Wi(k) [zi(k)−Hi(k)xi(k | k − 1)] , i = 1, · · · , S. (227)

These local estimates xi(k | k) may then be passed further up the hierarchy to a centralor intermediate processor which combines or fuses tracks to form a global estimate basedon all observations in the form

x(k | k) =S∑i=1

ωi(k)xk(k | i), (228)

where ωi(k) are site weighting matrices.The essential problem, as described in Section 3, is that each sensor site must be

observing a common true state and so the local process models are related through somecommon global model in the form

xi(k) = Fi(k)x(k) +Bi(k)u(k) +Gi(k)v(k), i = 1, · · · , S.This means that the predictions made at local sites are correlated and so the updated localestimates in Equation 227 must also be correlated despite the fact that the observationsmade at each site are different. Consequently, the local estimates generated at each sitecannot be combined in the independent fashion implied by Equation 228.

The correct method of dealing with this problem is to explicitly account for thesecorrelations in the calculation of the site weighting matrices ωi(k). In particular, thetrack-to-track fusion algorithms described in Section 3.2.5 in the form of Equations 208and 209 are appropriate to this problem. These algorithms require that the correlationsbetween all sites be explicitly computed in addition to the covariance associated with eachlocal state estimate.

There have been a number of papers on hierarchical estimation systems. The paper byHashemipour, Roy and Laub [21] is notable in employing, indirectly, the information formof the Kalman filter to derive a hierarchical estimation algorithm. The earlier paper bySpeyer [39] has a similar formulation, and although it is concerned with distributed linearquadratic Gaussian (LQG) control problems, also specifically deals with communication ortransmission requirements. Other large scale systems in control exhibit similar properties[37]. In addition, the track-to-track fusion techniques described by Bar-Shalom [15, 4]serve as the basis for many derived architectures.

A hierarchical approach to the design of data fusion systems also comes with a numberof inherent disadvantages. The ultimate reliance on some central processor or controlling


level within the hierarchy means that reliability and flexibility are often compromised.Failure of this central unit leads to failure of the whole system, changes in the system oftenmean changes in both the central unit and in all related sub-units. Further, the burdenplaced on the central unit in terms of combining information can often still be prohibitiveand lead to an inability of the design methodology to be extended to incorporate anincreasing number of sources of information. Finally, the inability of information sourcesto communicate, other than through some higher level in the hierarchy, eliminates thepossibility of any synergy being developed between two or more complimentary sourcesof information and restricts the system designer to rigid predetermined combinations ofinformation. The limitations imposed by a strict hierarchy have been widely recognisedboth in human information processing systems as well as in computer-based systems.

4.1.2 Distributed Data Fusion Architectures

The move to more distributed, autonomous, organisations is clear in many informationprocessing systems. This is most often motivated by two main considerations; the desireto make the system more modular and flexible, and a recognition that a centralisedor hierarchical structure imposes unacceptable overheads on communication and centralcomputation. The migration to distributed system organisations is most apparent inArtificial Intelligence (AI) application areas, where distributed AI has become a researcharea in its own right. Many of the most interesting distributed processing organisationshave originated in this area.

Track FusionAlgorithms

BlackboardMedium

z1(k)TrackingSystem 1

z2(k)TrackingSystem 2

Track IdentityKnowledge Base

SituationKnowledge Base

Remote SensorKnowledge Base

Figure 30: Blackboard Architecture in Data Fusion

Notable is the “Blackboard” architecture (see Figure 30), originally developed in theHearsay speech understanding programme, but now widely employed in many areas of AI


and data fusion research. A Blackboard architecture consists of a number of independentautonomous “agents”. Each agent represents a source of expert knowledge or specialisedinformation processing capability. Agents exchange information through a common com-munication facility or shared memory resource. This resource is called a blackboard. Theblackboard is designed to closely replicate its physical analogue. Each agent is able towrite information or local knowledge to this resource. Every agent in the system is ableto read from this resource, in an unrestricted manner, any information which it considersuseful in its current task. In principle, every agent can be made modular and new agentsmay be added to the system when needed without changing the underlying architecture oroperation of the system as a whole. The flexibility of this approach to system organisationhas made the Blackboard architecture popular in a range of application domains [33]. Indata fusion, the Blackboard approach has been most widely used for knowledge-baseddata fusion systems in data interpretation and situation assessment (see [19] and [20] forexample). However, the structured nature of tracking and identification problems doesnot lend itself to this anarchic organisational form.

The Blackboard architecture has a number of basic problems. All of these stem fromthe use of a common communication or memory resource. The core problem is that acentral resource naturally entails the need for some type of central control in which a singledecision maker is used to sequence and organise the reading and writing of informationfrom the shared resource. Practically, with such a control mechanism, a blackboardarchitecture becomes no more than a one level hierarchy with consequent lack of flexibilityand with the inherent limitations imposed by the use of a central resource.

4.1.3 Decentralised Data Fusion Architectures

A decentralized data fusion system consists of a network of sensor nodes, each with itsown processing facility, which together do not require any central fusion or central com-munication facility. In such a system, fusion occurs locally at each node on the basis oflocal observations and the information communicated from neighbouring nodes. At nopoint is there a common place where fusion or global decisions are made.

A decentralised data fusion system is characterised by three constraints:

1. There is no single central fusion center; no one node should be central to the suc-cessful operation of the network.

2. There is no common communication facility; nodes cannot broadcast results andcommunication must be kept on a strictly node-to-node basis.

3. Sensor nodes do not have any global knowledge of sensor network topology; nodesshould only know about connections in their own neighbourhood.

Figures 31 and 32 and 33 show three possible realisations of a decentralised data fusionsystem. The key point is that all these systems have no central fusion center (unlikethe ‘decentralised’ systems often described in the literature which are actually typicallydistributed or hierarchical).


The constraints imposed provide a number of important characteristics for decen-tralised data fusion systems:

• Eliminating the central fusion center and any common communication facility en-sures that the system is scalable as there are no limits imposed by centralizedcomputational bottlenecks or lack of communication bandwidth.

• Ensuring that no node is central and that no global knowledge of the networktopology is required for fusion means that the system can be made survivable tothe on-line loss (or addition) of sensing nodes and to dynamic changes in the networkstructure.

• As all fusion processes must take place locally at each sensor site and no globalknowledge of the network is required a priori, nodes can be constructed and pro-grammed in a modular fashion.

These characteristics give decentralised systems a major advantage over more traditionalsensing architectures, particularly in defense applications.

Sensor Node

Sensor

Fusion Processor

Communications Medium

Figure 31: A decentralised data fusion system implemented with a point-to-point com-munication architecture.

A decentralized organization differs from a distributed processing system in havingno central processing or communication facilities. Each sensor node in a decentralized


Communications Medium

Sensor Node

Sensor

Fusion Processor

Figure 32: A decentralised data fusion system implemented with a broadcast, fully con-nected, communication architecture. Technically, a common communication facility vio-lates decentralised data fusion constraints. However a broadcast medium is often a goodmodel of real communication networks.

Sensor Payloads

Internal Communciation

External Communication

Figure 33: A decentralised data fusion system implemented with a hybrid, broadcast andpoint-to-point, communication architecture.


organization is entirely self contained and can operate completely independently of anyother component in the system. Communication between nodes is strictly one-to-oneand requires no remote knowledge of node capability. Throughout this section we distin-guish between decentralized organizations that have no common resources, and distributedorganizations where some residual centralized facility is maintained.

4.2 Decentralised Estimation

Decentralised data fusion is based on the idea of using formal information measures asthe means of quantifying, communicating and assimilating sensory data. In decentralisedestimation of continuous valued states, this is implemented in the form of an informationfilter. In this section, the full form of the information filter is derived. It is then demon-strated how the filter may be decentralised amongst a number of sensing nodes. Latersections then describe how to deal with issues of communication and data association indecentralised sensing

4.2.1 The Information Filter

Conventional Kalman filters deal with the estimation of states x(i), and yield estimatesx(i | j) together with a corresponding estimate variance P(i | j). The information filterdeals instead with the information state vector y(i | j) and information matrix Y(i | j)defined as

y(i | j) = P−1(i | j)x(i | j), Y(i | j) = P−1(i | j). (229)

These information quantities have an interpretation related to the underlying probabilitydistributions associated with the estimation problem. The information matrix in par-ticular is closely associated with the Fisher information measures introduced in Section2.

A set of recursion equations for the information state and information matrix can bederived directly from the equations for the Kalman filter. The resulting information filteris mathematically identical to the conventional Kalman filter.

Recall the update stage for the Kalman filter:

x(k | k) = (1−W(k)H(k))x(k | k − 1) +W(k)z(k) (230)

P(k | k) = (1−W(k)H(k))P(k | k − 1)(1−W(k)H(k))T +W(k)R(k)WT (k) (231)

Now, from Equations 171 and 172, we have

1−W(k)H(k) = P(k | k)P−1(k | k − 1), (232)

andW(k) = P(k | k)HT (k)R−1(k). (233)

Substituting Equations 232 and 233 into Equation 230 gives

x(k | k) = P(k | k)P−1(k | k − 1)x(k | k − 1) +P(k | k)HT (k)R−1(k)z(k).


Pre-multiplying through by P−1(k | k) gives the update equation for the information-statevector as

P−1(k | k)x(k | k) = P−1(k | k − 1)x(k | k − 1) +HT (k)R−1(k)z(k). (234)

Defining

i(k)= HT (k)R−1(k)z(k) (235)

as the information-state contribution from an observation z(k), and with the definitionsin Equation 229, Equation 234 becomes

y(k | k) = y(k | k − 1) + i(k). (236)

A similar expression can be obtained for the covariance update of Equation 231. Substi-tuting Equations 232 and 233 into Equation 231 and rearranging gives

P−1(k | k) = P−1(k | k − 1) +HT (k)R−1(k)H(k). (237)

Defining

I(k)= HT (k)R−1(k)H(k) (238)

as the information matrix associated with the observation, and with the definitions inEquation 229, Equation 237 becomes the information matrix update equation

Y(k | k) = Y(k | k − 1) + I(k). (239)

A comparison of the Kalman filter update stage (Equations 230 and 231) with the in-formation filter update stage (Equations 236 and 239) highlights the simplicity of theinformation filter update over the Kalman filter update. Indeed, in information form, theupdate stage is a straight addition of information from a prediction and from an observa-tion. It is this simplicity which gives the information filter it’s advantage in multi-sensorestimation problems.

The simplicity of the update stage of the information filter comes at the cost of in-creased complexity in the prediction stage. Recall the covariance prediction equation

P(k | k − 1) = F(k)P(k − 1 | k − 1)FT (k) +G(k)Q(k)GT (k) (240)

To derive the prediction stage for the information filter, the following version of the matrixinversion lemma is noted [28]

(A+BTC

)T= A−1 −A−1BT

(1+CA−1BT

)−1CA−1.

Identifying

A = F(k)P(k − 1 | k − 1)FT (k), BT = G(k)Q(k), C = G(k),


the inverse of Equation 240 becomes

P−1(k | k − 1) = M(k)−M(k)G(k)[GT (k)M(k)G(k) +Q−1(k)

]−1GT (k)M(k) (241)

whereM(k) = F−T (k)P−1(k − 1 | k − 1)F−1(k) (242)

when Q(k) is non singular. Noting the definition of the state transition matrix

F(k)= Φ(tk, tk−1)

implies F−1(k) always exists and indeed

F−1(k) = Φ(tk−1, tk)

is simply the state transition matrix defined backwards from a time tk to tk−1. Now,defining

Σ(k)=[GT (k)M(k)G(k) +Q−1(k)

], (243)

and

Ω(k)= M(k)G(k)

[GT (k)M(k)G(k) +Q−1(k)

]−1= M(k)G(k)Σ−1(k), (244)

the information matrix prediction equation becomes

Y(k | k − 1) = P−1(k | k − 1) = M(k)−Ω(k)Σ(k)ΩT (k). (245)

A number of alternate expressions for the information prediction stage can also be derived:

Y(k | k − 1) =[1−Ω(k)GT (k)

]M(k) (246)

and

Y(k | k − 1) =[1−Ω(k)GT

]M(k)

[1−Ω(k)GT

]T+Ω(k)Q−1(k)ΩT (k). (247)

The information-state prediction equations may also be obtained as

y(k | k − 1) =[1−Ω(k)GT (k)

]F−T (k)

× [y(k − 1 | k − 1) +Y(k − 1 | k − 1)F−1(k)B(k)u(k)]

=[1−Ω(k)GT (k)

] [F−T (k)y(k − 1 | k − 1) +M(k)B(k)u(k)

]=

[1−Ω(k)GT (k)

]F−T (k)y(k − 1 | k − 1)

+Y(k | k − 1)B(k)u(k). (248)

It should be noted that the complexity of the inversion of Σ(k) is only of order thedimension of the driving noise (often scalar). Further Q(k) is almost never singular, and


indeed if it were, singularity can be eliminated by appropriate definition of G(k). Thespecial case in which Q(k) = 0 yields prediction equations in the form:

Y(k | k − 1) = M(k), y(k | k − 1) = F−T (k)y(k − 1 | k − 1)+M(k)B(k)u(k) (249)

The information filter is now summarisedPrediction:

y(k | k − 1) =[1−Ω(k)GT (k)

]F−T (k)y(k − 1 | k − 1) +Y(k | k − 1)B(k)u(k) (250)

Y(k | k − 1) = M(k)−Ω(k)Σ(k)ΩT (k) (251)

whereM(k) = F−T (k)Y(k − 1 | k − 1)F−1(k), (252)

Ω(k) = M(k)G(k)Σ−1(k), (253)

andΣ(k) =

[GT (k)M(k)G(k) +Q−1(k)

]. (254)

Estimate:y(k | k) = y(k | k − 1) + i(k) (255)

Y(k | k) = Y(k | k − 1) + I(k). (256)

wherei(k) = HT (k)R−1(k)z(k), I(k) = HT (k)R−1(k)H(k) (257)

Given the interpretation of the information matrix Y(i | j) as the Fisher information, theupdate Equation 256 is simply seen to add the information contributed by the observa-tion. Conversely, the prediction Equation 251 subtracts information caused by processuncertainties.

Example 27Consider again Example 17 of constant velocity particle motion:

[x(t)x(t)

]=[0 10 0

] [x(t)x(t)

]+[01

]v(t).

where v(t) is a zero mean white noise process with Ev2(t) = q. The state transitionmatrix and its inverse over a time interval δtare given by

F(δt) =[1 δt0 1

], F−1(δt) =

[1 −δt0 1

]

and the noise transition matrix by (Equation 73)

G(δt) =∫ δt0

[1 τ0 1

] [01

]dτ =

[δt2/2δt

]


Setting

Y(k − 1 | k − 1) =[Yxx YxvYxv Yvv

],

the propagated information is

M(k) = F−T (k)Y(k − 1 | k − 1)F−1(k) =[

Yxx Yxv − YxxδtYxv − Yxx Yxxδt

2 − 2Yxvδt+ Yvv

].

Then

Σ(k) =[GT (k)M(k)G(k) +Q−1(k)

]= δt2

[Yxxδt

2/4− Yxvδt+ Yvv]+ q−1

and

Ω(k) = M(k)G(k)Σ−1(k)

=1

Yxxδt4/4− Yxvδt

3 + Yvvδt2 + q−1

[Yxvδt− Yxxδt

2/2Yxxδt

3/2− 3Yxvδt2/2 + Yvv

].

Finally, denoting

M(k) =[Mxx Mxv

Mxv Mvv

],

the propagation gain matrix is found as

1−Ω(k)GT (k) =1

δt24Mxx − δtMxv +Mvv +

q−1

δt2

×[Mxvδt/2 +Mvv + q−1/δt2 Mxxδt/2 +Mxv

Mxvδt2/4 +Mvvδt/2 Mxx +Mxvδt/2 + q−1/δt2

].

Note also the term

F−T (k)y(k − 1 | k − 1) =[

1 0−δt 1

] [yy

]=[

yy − δty

]

Assume observations are made of the position of the particle at discrete synchronoustime intervals according to

z(k) = [ 1 0 ][x(k)x(k)

]+ w(k)

where w(t) is a zero mean white noise process with Ew2(t) = r. The information stateis simply given by

i(k) = HT (k)R−1(k)z(k) =[10

]r−1zx =

[zx/r0

],

and the information matrix by

I(k) = HT (k)R−1(k)H(k) =[10

]r−1 [ 1 0 ] =

[1/r 00 0

].


A key problem with this information state vector y(i | j) is that it has no obviousmetric properties; the difference between two information-state vectors does not relate atall to the difference in state vectors because of scaling by the information matrix. Thiscan make the interpretation of the information filter more difficult than the (state-based)Kalman filter.

There is a clear duality between the information filter and the conventional Kalmanfilter, in which the prediction stage of the information filter is related to the update stageof the Kalman filter, and the update stage of the information filter to the prediction stageof the Kalman filter [2]. In particular, identifying the correspondences

Ω(k)→W(k), Σ(k)→ S(k), GT (k)→ H(k)

Ω(k) is seen to take on the role of an information prediction gain matrix, Σ(k) an ‘in-formation innovation matrix’, and GT (k) an ‘information observation’. This duality isinstructive in understanding the relationship of information to state as equivalent repre-sentations. It is also of assistance in implementation of the information filtering equations.

With this interpretation, The information-filter form has the advantage that the up-date Equations 255 and 256 for the estimator are computationally simpler than the cor-responding equations for the Kalman Filter, at the cost of increased complexity in pre-diction. The value of this in decentralized sensing is that estimation occurs locally ateach node, requiring partition of the estimation equations which are simpler in their in-formation form. Prediction, which is more complex in this form, relies on a propagationcoefficient which is independent of the observations made and so is again simpler to de-couple and decentralize amongst a network of sensor nodes. This property is exploited insubsequent sections.

4.2.2 The Information Filter and Bayes Theorem

There is a strong relationship between the information state and the underlying proba-bility distribution. Recall Bayes Theorem

P (x(k) | Zk) = P (z(k) | x(k))P (x(k) | Zk−1)

P (z(k) | Zk−1). (258)

If it is assumed that the prior and likelihood are Gaussian as

P (x(k) | Zk) ∝ exp−1

2[x(k)− x(k | k)]T P−1(k | k) [x(k)k − x(k | k)]

P (z(k) | x(k)) ∝ exp−1

2[z(k)−H(k)x(k)]T R−1(k) [z(k)−H(k)x(k)]

,

then the posterior will also be Gaussian in the form

P (x(k) | Zk−1) ∝ exp−1

2[x(k)− x(k | k − 1)]T P−1(k | k − 1) [x(k)− x(k | k − 1)]

.


Now, substituting these distributions into Equation 258 and take logs gives

[x(k)− x(k | k)]T P−1(k | k) [x(k)− x(k | k)] =

[z(k)−H(k)x(k)]T R−1(k) [z(k)−H(k)x(k)]

+ [x(k)− x(k | k − 1)]T P−1(k | k − 1) [x(k)− x(k | k − 1)] + C(z(k)) (259)

where C(z(k)) is independent of x(k). This equation relates the log likelihoods of theseGaussian distributions and is essentially a quadratic in x(k). Now, differentiating thisexpression once with respect to x(k) and rearranging gives Equation 255. Differentiatinga second time and rearranging gives

P−1(k | k) = HT (k)R−1(k)H(k) +P−1(k | k − 1),

which is Equation 256. The first derivative of the log-likelihood is known as the scorefunction. The second derivative is, of course, the Fisher information Equation 50.

This analysis suggests that the information filter is, fundamentally, a log likelihoodimplementation of Bayes Theorem. The first and second derivatives of the log likelihoodare essentially moment generating functions (the first derivative is the centre of mass,and the second, the moment of inertia [34]). The information filter is thus a means forrecursive calculation of sufficient statistics (the mean and variance) in the case when thedistributions of interest our Gaussian.

This interpretation of the information filter shows the relationship between the Kalmanfilter and the more general probabilistic estimation problem. Indeed, if the distributionswere not Gaussian and indeed possibly discrete, the information filter would reduce tothe general fusion of log likelihoods as described in Section 2.2.6. This can be exploitedin the development of efficient distributed data fusion algorithms for problems which arenot linear or Gaussian.

4.2.3 The Information filter in Multi-Sensor Estimation

The information filter is reasonably well known in the literature on estimation [2, 28].However, its use in data fusion has been largely neglected in favour of conventional state-based Kalman filtering methods. The reasons for this appear to be somewhat spurious,based largely on the incorrect hypothesis that it is “cheaper” to communicate innovationinformation (of dimension the observation vector) than to communicate information statevectors (of dimension the state vector). The basic problem is that it is generally not possi-ble to extend Equations 230 and 231 in any simple way to deal with multiple observations.The reason for this, as we have seen, is that although the innovation vectors at differenttimes are uncorrelated (by construction), the innovations generated by different sensorsat the same time are correlated, by virtue of the fact that they use a common prediction.Thus, the innovation covariance matrix S(k) can never be diagonal, and so can not besimply partitioned and inverted to yield a gain matrix for each individual observation.


Thus, for a set of sensors i = 1, · · · , N , it is not possible to compute the simple sum ofinnovations as

x(k | k) = x(k | k − 1) +N∑i=1

Wi(k) [zi(k)−Hi(k)x(k | k − 1)]

in which the individual gains are given by

Wi(k) = P(k | k − 1)HTi (k)S

−1i (k).

However, the information filter provides a direct means of overcoming these problems.As will be seen, for the same set of sensors, i = 1, · · · , N , and without any additionalassumptions, it is the case that the information contributions from all sensors can simplybe added to obtain an updated estimate in the form

y(k | k) = y(k | k − 1) +N∑j=1

ij(k),

which is algebraically equivalent to a full (all sensor) state-based Kalman filter estimate.The reason why this can be done with the information filter is that the information

contributions made by the observations are directly related to the underlying likelihoodfunctions for the states rather than to the state estimates themselves. This can be appreci-ated by considering the interpretation of the information filter directly an implementationof Bayes theorem in terms of log-likelihood rather than in terms of state. Indeed, makingthe usual assumption that the observations made by the various sensors are conditionallyindependent given the true state as

P (z1(k) · · · , zN(k) | x(k)) =N∏i=1

P (zi(k) | x(k)), (260)

then Bayes Theorem gives (Equation 19)

P (x(k) | Zn(k)) = P (x(k) | Zn(k − 1))n∏i=1

P (zi(k) | x(k)).[P (Zn(k) | Zn(k − 1))]−1.

(261)Taking logs of Equation 261 then gives

lnP (x(k) | Zn(k)) = lnP (x(k) | Zn(k − 1))

+n∑i=1

lnP (zi(k) | x(k))− lnP (Zn(k) | Zn(k − 1)). (262)

Given Equation 259, an identification can be made as

n∑i=1

lnP (zi(k) | x(k)) ∑ni=1 i(k),∑ni=1 I(k) . (263)


This demonstrates why, fundamentally, it is possible to add information states and Infor-mation matrices from different sensors while it is not possible to add innovations withoutaccounting for cross-correlations. For this reason also, the information filter is occasionallyreferred to as the likelihood filter.

Computationally, the information filter thus provides a far more natural means of as-similating information than does the conventional Kalman filter and a far simpler methodof dealing with complex multi-sensor data fusion problems.

The linear addition of information in the information filter can also be obtained bydirect algebraic manipulation. Consider a system comprising N sensors each taking ob-servations according to

zi(k) = Hi(k)x(k) + vi(k) (264)

withEvp(i)vTq (j) = δijδpqRp(i). (265)

The observations can be stacked into a composite observation

z(k) =[zT1 (k), · · · , zTN(k)

]T(266)

The observation model can also be stacked into a composite model

H(k) =[HT

1 (k), · · · ,HTN(k)

]T, (267)

andv(k) =

[vT1 (k), · · · ,vTN(k)

]T(268)

to give a composite observation equation in familiar form

z(k) = H(k)x(k) + v(k).

Noting that

Ev(k)vT (k) = R(k) = blockdiagRT1 (k), · · · ,RTN(k) , (269)

a multiple sensor form of Equation 257 can be obtained as

i(k) =N∑i=1

ii(k) =N∑i=1

HTi (k)R

−1i (k)zi(k) (270)

and

I(k) =N∑i=1

Ii(k) =N∑i=1

HTi (k)R

−1i (k)Hi(k) (271)

whereii(k)

= HT

i (k)R−1i (k)zi(k) (272)

is the information-state contribution from observation zi(k) and

Ii(k)= HT

i (k)R−1i (k)Hi(k) (273)


is its associated information matrix.Equations 270 and 271 show that the total information available to the filter at any

time-step is simply the sum of the information contributions from each of the individualsensors. Further, Equations 255 and 256 describing the single sensor information updateare easily extended to multiple sensor updates in a straight-forward manner as

y(k | k) = y(k | k − 1) +N∑i=1

ii(k) (274)

Y(k | k) = Y(k | k − 1) +N∑i=1

Ii(k). (275)

These equations should immediately be compared to the considerably more complex mul-tiple sensor form of the Kalman filter update Equations.

4.2.4 The Hierarchical Information Filter

It is now shown how the information filter may be partitioned to provide a simple hierar-chical estimation architecture based first on the communication of the information termsi(·) and I(·) from sensor nodes to a common fusion center, and second on the communica-tion of partial information-state estimates from nodes to a central assimilation point. Thelatter case corresponds to the algorithm developed in [21]. In Section 2.2.6 it was demon-strated that, because of simple information summation, the log-likelihood form of Bayestheorem can readily be mapped to a number of different architectural forms. For the samereasons, the information additions in Equations 274 and 275 can also be distributed in asimple manner.

One hierarchical architecture that employs this additive property is shown in Figure 34(this is the information filter form of Figure 8). Each sensor incorporates a full state modeland takes observations according to Equation 264. They all calculate an information-state contribution from their observations in terms of ii(k) and Ii(k). These are thencommunicated to the fusion centre and are incorporated into the global estimate throughEquations 274 and 275. The information-state prediction is generated centrally usingEquations 250 and 251 and the state estimate itself may be found at any stage fromx(i | j) = Y−1(i | j)y(i | j). To avoid communicating predictions to nodes, any validationor data association should take place at the fusion center.

A second hierarchical system which allows local tracks to be maintained at local sensorsites, is shown in Figure 35 (this is the information filter form Figure 9). In this system,each sensing node produces local information-state estimates on the basis of its own obser-vations and communicates these back to a central fusion center where they are assimilatedto provide a global estimate of information-state. Let yi(· | ·) be the information-stateestimate arrived at by each sensor site based only on its own observations and localinformation-state prediction yi(k | k − 1). This local estimate can be found from a localform of Equations 255 and 256 as

yi(k | k) = yi(k | k − 1) + ii(k) (276)


H1

Hn

H2

i 1(k), I 1(k) z1(k)

z2(k)

zN(k)

Sensor

Sensor Models

in(k), I

n(k)

i 2(k), I 2(k)

Σ

k k-1

Central Processor

y(k|k-1), Y(k|k-1)^

y(k|k), Y(k|k)^

Prediction

Figure 34: A hierarchical data fusion system where information-state contributions arecalculated at each sensor node and transmitted to a central fusion center where a commonestimate is obtained by simple summation. All state predictions are undertaken at thecentral processor.

Σ

k k-1

Central Processor

y(k|k-1), Y(k|k-1)^

y(k|k), Y(k|k)^

Σ i

1(k), I

1(k)

z1(k)

k k-1

HTR-11 1y1(k|k), Y 1(k|k)~ ~

y1(k|k-1), Y

1(k|k-1)^

Prediction

Σ i

n(k), I

n(k)

zn(k)

k k-1

HTR-1n n

Prediction

yn(k|k-1), Y

n(k|k-1)^

yn(k|k), Y

n(k|k)~ ~

Sensor node n

Sensor node 1

Figure 35: A hierarchical data fusion system where tracks are formed locally and com-municated to a central fusion site. Each sensor node undertakes a prediction stage andmaintains a track based only on it’s local observations. The central processor fuses thesetracks.


andYi(k | k) = Yi(k | k − 1) + Ii(k). (277)

These partial information-state estimates are communicated to a fusion center where theycan be assimilated according to

y(k | k) = y(k | k − 1) +N∑i=1

[yi(k | k)− y(k | k − 1)] (278)

and

Y(k | k) = Y(k | k − 1) +N∑i=1

[Yi(k | k)−Y(k | k − 1)

]. (279)

With the assumption that the local predictions at each node are the same as the predictionproduced by a central fusion center, it can be seen that these assimilation equations areidentical to Equations 274 and 275.

Σ

k k-1

Central Processor

y(k|k-1), Y(k|k-1)^

y(k|k), Y(k|k)^

y(k|k)-[y n(k|k)-y n(k|k-1)]~

~y(k|k)-[y 1(k|k)-y 1(k|k-1)]

Σ i

n(k), I

n(k)

zn(k)

k k-1

HTR-1n n

Prediction

yn(k|k-1), Y

n(k|k-1)^

yn(k|k), Y

n(k|k)~ ~

Sensor node n

Σ

Σ i

1(k), I

1(k)

z1(k)

k k-1

HTR-11 1y1(k|k), Y 1(k|k)~ ~

y1(k|k-1), Y

1(k|k-1)^

Prediction

Sensor node 1

Σ

Figure 36: A hierarchical data fusion system where tracks are formed locally and commu-nicated to a central fusion site, then track updates are communicated back to the sensornodes. Each sensor node undertakes a prediction stage and maintains a global track basedon the observations of all sensor nodes.

A third hierarchical system which allows global tracks to be maintained at local sensorsites is shown in Figure 36. This architecture is similar to that shown in Figure 35 exceptthat once a global estimate is obtained at the central fusion site, it is communicated backto the local site where it is assimilated to form a global estimate. The advantage of thisarchitecture is that each site can now act in an “autonomous” manner with access toglobal track information.

There are a number of other possible implementations of these hierarchical estimationequations, depending on where it is most efficient to generate predictions and wheredifferent parts of the system model reside. The important point to note is that theinformation filter provides a simple and natural method of mapping estimation equationsto different architectures.


4.2.5 The Decentralised Information Filter

Σ i

n(k), I

n(k)

zn(k)

k k-1

HTR-1n n

Prediction

yn(k|k-1), Y

n(k|k-1)^

yn(k|k), Y

n(k|k)~ ~

Sensor node n

Σ~

[y1(k|k)-y 1(k|k-1)]

Σ i

1(k), I

1(k)

z1(k)

HTR-111

Prediction

Sensor node 1

Σy

1(k|k-1), Y

1(k|k-1)^

y1(k|k), Y

1(k|k)

~~

k k-1

Σ i

2(k), I

2(k)

z2(k)

k k-1

HTR-12 2y2(k|k), Y 2(k|k)~ ~

y2(k|k-1), Y

2(k|k-1)^

Prediction

Sensor node 2

Σ

~[y2(k|k)-y 2(k|k-1)]

~[yn(k|k)-y

n(k|k-1)]

Figure 37: A Decentralized, fully connected, data fusion system where tracks are formedlocally and communicated to neighbouring nodes where they are assimilated to provideglobal track information. Each sensor node undertakes a prediction stage and maintainsa global track based on the observations of all sensor nodes. This architecture is alsodirectly equivalent to a broadcast or bus communication system.

It is a relatively simple step to decentralize the assimilation Equations 278 and 279in systems where there is a fully connected network of sensing nodes as shown in Figure37. In this type of system, each node generates a prediction, takes an observation andcomputes a local estimate which is communicated to all neighbouring nodes. Each nodereceives all local estimates and implements a local form of the assimilation equations toproduce a global estimate of information-state, equivalent to that obtained by a centralfusion center. In effect, this is the same as replicating the central assimilation equationsof Figure 36 at every local sensor site, and then simplifying the resulting equations.

It is assumed that each local sensor site or node maintains a state-space model(F(k),G(k),Q(k)) identical to an equivalent centralized model so that yi(· | ·) ≡ y(· | ·)for all i = 1, · · · , N . Each node begins by computing a local estimate yi(k | k) basedon a local prediction yi(k | k − 1) and the observed local information ii(k) according to(Equations 250 and 251)

Prediction:

yi(k | k − 1) =[1−Ωi(k)G

T (k)]F−T (k)yi(k − 1 | k − 1) +Yi(k | k − 1)B(k)u(k)

(280)Yi(k | k − 1) = Mi(k)−Ωi(k)Σi(k)Ω

Ti (k) (281)

whereMi(k) = F−T (k)Yi(k − 1 | k − 1)F−1(k), (282)

Ωi(k) = Mi(k)G(k)Σ−1i (k), (283)


andΣi(k) =

[GT (k)Mi(k)G(k) +Q−1(k)

]. (284)

Estimate:yi(k | k) = yi(k | k − 1) + ii(k) (285)

Yi(k | k) = Yi(k | k − 1) + Ii(k). (286)

these partial information-state estimates are then communicated to neighbouring nodeswhere they are assimilated according toAssimilate:

yi(k | k) = yi(k | k − 1) +N∑j=1

[yj(k | k)− yj(k | k − 1)] (287)

Yi(k | k) = Yi(k | k − 1) +N∑j=1

[Yj(k | k)−Yj(k | k − 1)

](288)

If each node begins with a common initial information-state estimate yj(0 | 0) = 0,Yj(0 | 0) = 0 and the network is fully connected, then the estimates obtained by eachnode will be identical.

The quantities communicated between sensor nodes; (yj(k | k) − yj(k | k − 1)) and(Yj(k | k) − Yj(k | k − 1)), consist of the difference between the local information ata time k and the prediction based only on information up to time k − 1. This can beinterpreted as the new information obtained by that node at the current time step. Indeed,the communicated terms are algebraically equivalent to ij(k) and Ij(k); logically the newinformation available at a time k is just the information obtained through observationat that time. Thus, the operation of the sensor network can be envisioned as a group oflocal estimators which communicate new, independent, information between each otherand which assimilate both local and communicated information to individual obtain aglobally optimal local estimate.

There are three interesting points that can be made about these decentralized equa-tions:

• The additional computation required of each node to assimilate information fromadjacent nodes is small; a summation of vectors in Equation 287 and a summation ofmatrices in Equation 288. This is a direct consequence of the use of the informationform of the Kalman filter which places the computational burden on the generationof predictions.

• The amount of communication that needs to take place is actually less than isrequired in a hierarchical organization. This is because each node individually com-putes a global estimate so that there is no need for estimates or predictions to becommunicated prior to an estimation cycle. This results an a halving of requiredcommunication bandwidth, which may be further improved if model distribution isincorporated.


• The assimilation equations are the same as those that would be obtained in a systemwith distributed sensing nodes and a broadcast communication system.

The algorithm defined by Equations 280–288 is appropriate for both fully connected sensornetworks or for sensors connected to a broadcast communication facility (such as a busor Blackboard).

4.3 Decentralised Multi-Target Tracking

4.3.1 Decentralised Data Association

Data association in distributed systems is a complex problem. The reason for this isthat hard association decisions made locally, in an optimal manner with respect to localobservations, may not be optimal at the global level when all sensor information is madeavailable. Further, an incorrect association decision is almost impossible to undo oncedata has been fused into a track.

The normal approach to this problem is to maintain both a local and a global trackfile and periodically to synchronize the two (see for example [9] pp602–605). Alterna-tive approaches involve using either probabilistic data association methods or multiple-hypothesis trackers both of which avoid the need to make hard local association decisions(as described in Section 3.4).

All data association methods require that the normalised innovation is made availableat each of the local processor nodes. In a decentralised data fusion architecture the infor-mation transmitted from one node to another is in the form of the information-theoreticquantities i(k) and I(k). To implement a local validation procedure it is therefore neces-sary to derive an expression from the prediction and communication terms which allowcomputation of a normalised innovation by every node on the basis of local informationy(k | k − 1) and Y(k | k − 1). The result is normalised information residual from whichan information gate, equivalent to the innovation gate, can be obtained.

To formulate the information gate, the inverse of the information matrix I(k) is re-quired. Generally, the dimension of the observation vector is less than that of the statevector, so the information matrix I(k) = HT (k)R−1(k)H(k) is singular since it has ranknz, equal to the size of the observation vector, but has dimension nx × nx, the size of thestate vector; and generally nx > nz. Consequently the inverse information matrix (thecorresponding covariance matrix) is not well defined and instead the generalised-inverseI†(k). The generalised-inverse is defined such that

I(k)I†(k) = E, (289)

where E is an idempotent matrix which acts as the identity matrix for the informationmatrix and its generalised-inverse [25, 36]

I(k)E = I(k), I†(k)E = I†(k), (290)

so thatI(k)I†(k)I(k) = I(k), I†(k)I(k)I†(k) = I†(k). (291)


Amongst all possible generalised inverses that satisfy Equation 289, 290 and 291, themost appropriate definition is that which projects I(k) into the observation space in thefollowing form;

I†(k)= HT (k)

[H(k)I(k)HT (k)

]−1H(k) (292)

This generalised inverse exploits the role of H(k) as a projection operation, taking statespace to observation space, and HT (k) as a back-projection, taking observation space tostate space [42]. Both projections do not change the content of the projected matrixso they can be applied without modifying the information contribution. MultiplyingEquation 292 by H(k)I(k) demonstrates this

H(k)I(k)I†(k) = H(k)I(k)HT (k)[H(k)I(k)HT (k)

]−1H(k) = H(k) (293)

The innovation measure is essential for data association and fault detection. Theinnovation is based on the difference between a predicted and observed measurement

ν(k) = z(k)−H(k)x(k | k − 1). (294)

In information form, the measurement information is provided by the information vectori(k) = HT (k)R−1(k)z(k). The information residual vector is therefore defined analogously

υ(k)= HT (k)R−1(k)ν(k), (295)

which is simply the innovation ν(k) projected into information space. Substituting Equa-tion 294 into Equation 295 gives

υ(k) = HT (k)R−1(k)z(k)−HT (k)R−1(k)H(k)x(k | k − 1) (296)

orυ(k) = i(k)− I(k)Y−1(k | k − 1)y(k | k − 1) (297)

The information residual variance is computed from

Υ(k) = Eυ(k)υT (k) | Zk−1(k) . (298)

Substituting in Equation 295 gives

Υ(k) = HT (k)R−1(k)Eν(k)νT (k) | Zk−1(k) R−1(k)H(k)

= HT (k)R−1(k)[HT (k)P(k | k − 1)H(k) +R(k)

]R−1(k)H(k)

= I(k) + I(k)Y−1(k | k − 1)I(k)

= I(k)[I†(k) +Y−1(k | k − 1)

]−1I(k) (299)

The normalised information residual can now be computed from

Γ(k) = υT (k)Υ†(k)υ(k) (300)


noting that the pseudo-inverse for Υ(k) is

Υ†(k) = HT (k)[H(k)Υ(k)HT (k)

]−1H(k)

= HT (k)[H(k)HT (k)R−1(k)S(k)R−1(k)H(k)HT (k)

]−1H(k)

= HT (k)[H(k)HT (k)

]−1R(k)S−1(k)R(k)

[H(k)HT (k)

]−1H(k) (301)

and substituting Equations 295 and 301 into Equation 300 gives

υT (k)Υ†(k)υ(k) = νT (k)R−1(k)H(k)HT (k)[H(k)HT (k)

]−1R(k)

×S−1(k)R(k)[H(k)HT (k)

]−1H(k)HT (k)R−1(k)ν(k)

= νT (k)S−1(k)ν(k). (302)

That is, the normalised information residual is identically the conventional (observation)residual.

Thus to implement a data association or gating policy for a decentralised sensingnetwork using the information filter, a normalised gate (or modified log-likelihood) canbe constructed using Equations 297 and 299. This gate will be identical to the gate usedin conventional multi-target tracking algorithms. Once the gate is established, it becomespossible to use any of the data association methods described in Section 3.4.

4.3.2 Decentralised Identification and Bayes Theorem

Decentralised data fusion principles can be easily extended to situations in which theunderlying probability densities are not Gaussian and indeed where the densities arediscrete. This provides a means of implementing decentralised (discrete) identificationalgorithms. The method is based on the use of log-likelihoods and extends the hierarchicallog-likelihood architectures described in Section 2.2.6.

Recall the recursive form of Bayes theorem

P (x | Zk) = P (z(k) | x)P (x | Zk−1)

P (z(k) | Zk−1), (303)

which may be written in terms of log-likelihoods as

lnP (x | Zk) = lnP (x | Zk−1) + lnP (z(k) | x)

P (z(k) | Zk−1), (304)

where x is the state to be estimated and Zk is the set of observations up to the kth

timestep. As has been previously demonstrated, the information form of the Kalmanfilter, for example, can be derived directly from this expression.

In Equation 304, the term lnP (x | Zk−1) corresponds to information accumulatedabout the state up to time k − 1. The term

lnP (z(k) | x)

P (z(k) | Zk−1)


Σ

zi(k)

k k-1

log P( x|Z ki)

Sensor i

log P( x|Zk-1 i)

P(z i(k)| x)P(z i(k)|Z k)

log

P(z i(k)| x)P(z i(k)|Z k)

log

P(z j(k)| x)P(z j(k)|Z k)

log

P(z n(k)|x)P(z

n(k)|Z k)

log

CommunicatedTerms

Figure 38: A Decentralized, fully connected, Bayesian data fusion system appropriate fordecentralised identification tasks.

corresponds to the new information generated at time k. Equation 304 exactly representsthe communication requirements of a fully connected decentralised system, in which eachnode communicates a likelihood based on its own observation to all others and receives thelikelihoods of all its neighbours which are then fused to form a global estimate. RewritingEquation 304 in terms of an individual node i, this global estimate is computed as

lnP (xi | Zk) = lnP (xi | Zk−1) +∑j

lnP (zj(k) | xj)P (zj(k) | Zk−1)

, (305)

where the summation represents the communicated terms. This is illustrated for a nodei in Figure 38.

4.4 Communication in Decentralised Sensing Systems

The decentralised data fusion algorithms described so far are implicitly limited in requiringfull communication, either as a fully connected sensing network or as a broadcast system.The reason for this is that it is assumed that all new information in the network is madeavailable to all sensors at observation time.

Fully connected, “complete information” networks are not practically viable: Any re-alisable distributed sensing network should be able to cater for a variety of communicationtopologies, variable communication delays and insertion of new sensors into the network.The key to providing these abilities lies in the algorithms used to decide what informationshould be communicated between different sensing nodes. This is the focus of this section.

It is first demonstrated that the need for fully-connectedness is a result of the assump-tion that all nodes share a common prediction. Implementation of the straight-forwarddecentralised data fusion equations consequently results in a nearest-neighbour commu-nication policy in which information is not propagated through the network.


To overcome this, the idea of a channel filter is introduced. A channel filter recordsthe information that is communicated between nodes. In a tree-connected network theinformation communicated between nodes is clearly also the information they have incommon. By subtracting this common information from any future communication, only“new” information is communicated. However, it is also demonstrated that in generalsensor network topologies, the channel filter is impossible to implement in practice. Thisis because the common information between two nodes can not be established uniquely.Three (sub-optimal) methods of overcoming this problem are described.

The channel filter provides a valuable buffer between a sensor node and the remainderof the sensing network. The channel filter can be used to implement many practical aspectsof network operation including intermittent communication, insertion of new and removalof old communication links. An algorithm is developed to dealing with information thatis delayed or asequent (out of temporal order). Using this algorithm, general channeloperations are described.

4.4.1 Fully Connected and Broadcast Sensor Networks

The assumption of a fully connected topology is, in general, unrealistic within the node-to-node communication constraints imposed. This is because as the number of nodes grows,the number of communication links required by each node also increases. In addition,the loss of any one communication channel will result in the fully-connected assumptionbeing violated.

Recall the assimilation equations

yi(k | k) = yi(k | k − 1) +∑j

[yj(k | k)− yj(k | k − 1)]

= yi(k | k − 1) +∑j∈Ni

ij(k) (306)

and

Yi(k | k) = Yi(k | k − 1) +∑j

[Yj(k | k)−Yj(k | k − 1)

]

= Yi(k | k − 1) +∑j∈Ni

Ij(k). (307)

Equations 306 and 307 make it clear that the estimate arrived at locally by a node iis based on the observation information ij(k) and Ij(k) communicated to it by nodes j.If node i only communicates with a local neighbourhood Ni, a subset of the completenetwork, then the estimate arrived at locally will be based only on this information andwill not be equivalent to a centralised estimate.

In effect, each time a new local estimate yj(k | k) is obtained at a node j, the priorinformation at this node yj(k | k − 1) is subtracted to provide the ‘new’ informationto be communicated to a node i. The assumption here is that the prior informationyj(k | k − 1) at node j is the information that nodes i and j have in common up to time


k−1. Subtracting this from the local estimate should then give the new information to becommunicated from node j to node i. In the fully-connected case, the prior informationat node j is indeed the common information between nodes i and j and so only theinformation ij(k) is communicated. In the non-fully connected case, the prior informationat node j includes not only information common to node i but also information from otherbranches in the network. Subtracting this from the local estimate however, again resultsin only the information ij(k) being communicated to node i. This is because it assumesthat node i already has the information communicated to node j through other branchesof the network. The net result of this is that nodes only exchange information in theirimmediate neighbourhoods and do not propagate information from distant branches.

ZB

ZD

ZA

ZC

C

A B

D

Figure 39: A fully connected sensor network consisting of four nodes, A, B, C, and D.All nodes in this network can generate estimates, equal to a centralised estimate, usingonly Equations 306 and 307.

Figure 39 shows a fully connected network in which local estimates will be equal toglobal estimates using only Equations 306 and 307 for assimilation. Figure 40 shows alinear connected sensor network and Figure 41 shows a tree connected sensor network.Here, the estimates arrived at by each node, using Equations 306 and 307 for assimilation,will be based only on observations made by sensors in the immediate neighbourhood.

4.4.2 Identification of Redundant Information in Sensor Networks

The key step in deriving fusion equations for decentralised sensing networks is to identifythe common information among estimates so that it is not used redundantly. The problemof accounting for redundant information in decentralised communication structures is en-


ZB

ZC ZD

ZA

A B

C D

Figure 40: A linear sensor network consisting of four nodes, A, B, C, and D. Nodes inthis network, using only Equations 306 and 307, can only generate estimates based oninformation available in their immediate neighbourhood.

ZA ZB

ZC ZD

A B

C D

Figure 41: A tree connected sensor network consisting of four nodes, A, B, C, and D.Nodes in this network, using only Equations 306 and 307, generate estimates based oninformation available in their immediate neighbourhood.


countered in many applications13. In decentralised data fusion systems, the incorporationof redundant information may lead to bias, over-confidence and divergence in estimates.

The problem of identifying common information is most generally considered in termsof information sets. A neighbourhood of a node is the set of nodes to which it is linkeddirectly. A complete neighbourhood includes the node itself. A node i forms an infor-mation set Zki at time k based on a local sensor observation zi(k) and the informationcommunicated by its neighbours. The objective of each node is to form the union of theinformation sets in the complete neighbourhood [Ni] :

⋃j∈[Ni] Z

kj . In particular, consider

communications between node i and a neighbour j. The union of information sets, onwhich estimates are to be based, may be written as

Zki ∪ Zkj = Zki + Zkj − Zki∩j, (308)

that is, the union is equal to the sum of information sets communicated minus the inter-section of, or common information between, these information sets. A fully decentralisedsolution to the estimation problem is only possible where the intersection or commoncommunicated information Zki∩j can be determined from information which is availablelocally.

The topology of the sensing network is the most important factor in determiningcommon communicated information. Consider the following three cases:

• Full Connection: Consider the case in which each node is connected to every otherin a fully connected, or completely connected topology (Figure 39). In this case, thesensor nodes may acquire observation information from the entire network throughdirect communication and the neighbourhood of any node is the full network. In afully connected network, the problem of locally detecting and eliminating redundantinformation is considerably simplified as every node has access to the same infor-mation. In this situation, the estimates formed by the nodes are identical and newinformation is immediately communicated after the removal of the estimate fromthe previous time-step as

⋃j∈[Ni]

Zkj =

∑j∈[Ni]

Zkj −⋃j∈[Ni]

Zk−1j

. (309)

This gives rise to a communication strategy where each node subtracts the estimateformed at the previous timestep prior to communicating its current observation in-formation. Thus, Equation 309 is an information-set equivalent of Equations 306and 307 which explicitly subtract the common prediction from the local partialestimates. Since the fully connected condition means that global and local informa-tion are in fact the same, this must be regarded as a special case of the commoninformation problem.

13In human communication decentralised and cyclic communication structures give rise to “rumourpropagation” or the “chicken licken” problem.


• Tree Connection: In a tree connected topology, there is only one path betweeneach pair of nodes (see Figures 40 and 41). Therefore, a receiving node can becertain that the only redundant information communicated by a neighbour is theinformation that they have exchanged in the past. Thus, the observation informationhistory that is required for a tree communication system extends only to the previoustimestep. The only common information between node i and a neighbour j at timek is the information which they exchanged in the last time interval k − 1. Thisresults in a pair-wise communication algorithm. For a node i on link (i, j),

Zki ∪ Zkj = Zki + Zkj − Zki∩j,

where, crucially, the intersection can be found from information sets at the previoustimestep,

Zk−1i∩j = Zk−1

i ∪ Zk−1j . (310)

This generalises to a communication strategy over the neighbourhood as

⋃j∈[Ni]

Zkj = Zki +

∑j∈Ni

Zkj −⋃j∈Ni

Zk−1j

, (311)

where [Ni] is the complete neighbourhood, and i ∈ Ni.

• Arbitrary Networks: In an arbitrary network, the connectedness of nodes is un-known and may be partially connected, or non fully connected non-trees. Removal ofcommon information in arbitrary network topologies is complex because the patternof communication varies from node to node, yet each node is required to implementthe same algorithm. Consider again two communicating nodes

Zki ∪ Zkj = Zki + Zkj − Zki∩j.

In the arbitrary network case, the intersection term Zki∩j can not be simply deter-mined from past communication on a single link. In the tree case, the commoninformation term depends only on the observation information of i and j. In anon-tree network, the information common to i and j may contain terms from othernodes outside of the neighbourhood. This is because multiple propagation pathsare possible. Figure 42 illustrates the information that is communicated to nodesin three cases. The communicated information terms at i and j are denoted Ti andTj respectively. The information Tj is integrated into the observation informationsets Zj upon arrival at j. This information must then be acquired by i throughsome operation of union or intersect of information sets. To main the constraintsimposed by full decentralisation, it must be possible to eliminate common commu-nicated information terms between any pair of communicating nodes, on the basisof local information only. In a fully connected network, the solution is immediatesince Ti = Tj at every time-step and Equation 309 follows. A tree network is par-titioned about any pair of connected nodes (i, j) such that the information Ti held


in the subtree from i and that held in the subtree from j are disjoint: Ti ∩ Tj = ∅.Therefore, i acquires the terms from the subtree Tj only through j. In an arbitrarynetwork, it may be possible for i to acquire the information known to j along otherroutes. The problem is that the communicated terms are not necessarily disjoint,therefore, each node must be able to determine Ti ∩ Tj locally. As shown in Figure42, information in the region of intersection arrives at both i and j. This multi-ple propagation must be accounted for. The problem of arbitrary networks can bealternately be considered as estimation in networks which admit multiple cycles.

Determination of the communication requirements for non-fully connected decentralisednetworks therefore hinges on the extraction of common information.

Ti

i j

Tj

Subtrees ConnectedThrough (i,j)

Ti=Tj

i jFull Connection

Ti

i j

Tj

Subtnetworks ConnectedThrough (i,j)

Figure 42: Communicated information sets in the three classes of topology.

4.4.3 Bayesian Communication in Sensor Networks

Expressions for the common information between two nodes are now derived from Bayestheorem. This in turn establishes the relationships between log-likelihoods from which


common information filters can be derived.The interaction of pairs of nodes are considered. For each communicating pair (i, j),

the required probability is P (Zi ∪ Zj). Let the union of the individual observation infor-mation sets be partitioned into disjoint sets as

Zi ∪ Zj = Zi\j ∪ Zj\i ∪ Zij (312)

whereZi\j = Zi\Zij, Zj\i = Zj\Zij, Zij = Zi ∩ Zj,

and where the notation p\r (the restriction operation) means elements of the set p ex-cluding those elements that are also in set r. Note also that

Zi\j ∪ Zij = Zi, Zj\i ∪ Zij = Zj.

Then,

P (Zi ∪ Zj | x) = P (Zi\j ∪ Zj\i ∪ Zij | x)= P (Zi\j | Zj\i ∪ Zij,x)P (Zj\i ∪ Zij | x) = P (Zi\j | Zij,x)P (Zj | x)=

P (Zi\j ∪ Zij | x)P (Zij | x) P (Zj | x) = P (Zi | x)

P (Zij | x)P (Zj | x)

=P (Zi | x)P (Zj | x)P (Zi ∩ Zj | x) . (313)

Substituting Equation 313 into Bayes theorem gives

P (x | Zi ∪ Zj) =P (Zi ∪ Zj | x)P (x)

P (Zi ∪ Zj)=

P (Zi | x)P (Zj | x)P (Zi ∩ Zj | x)

P (x)

P (Zi ∪ Zj)=

P (x | Zi)P (x | Zj)P (x | Zi ∩ Zj)

P (Zi)P (Zj)

P (Zi ∩ Zj) P (Zi ∪ Zj)

= c.P (x | Zi)P (x | Zj)P (x | Zi ∩ Zj) . (314)

This shows that the relation between the posterior probability in the unknown state giveninformation from both nodes, P (x | Zi ∪ Zj), as a function of the posteriors based onlyon locally available information, P (x | Zi) and P (x | Zj), and the information the twonodes have in common P (x | Zi ∩ Zj).

Taking logs of Equation 314, gives

lnP (x | Zi ∪ Zj) = lnP (x | Zi) + lnP (x | Zj)− lnP (x | Zi ∩ Zj). (315)

Equation 315 simply states that the fused information is constructed from the sum of theinformation from each of the nodes minus the information they have in common. Theterm lnP (x | Zi ∩ Zj) describes the common information between two nodes which mustbe removed before fusion.

Equation 315 serves as the basis for developing information communication policiesfor non-fully connected sensor networks.


• General Networks: In networks which admit multiple cycles, it is possible forinformation from a node j to be communicated to a node i both directly throughthe link (i, j) and indirectly through other nodes connected to both i and j. In suchcases, simply adding up information transmitted through the direct connection (i, j)is not the same as determining the common information yij(k | k) and Yij(k | k)between these two nodes. Indeed, in general it is not possible to determine thiscommon information on the basis of local information only. A number of alternativeapproaches to fusion in general networks are considered in Section 4.4.7.

Equations 321 and 322 define an information filter which estimates the common infor-mation between nodes. The filter is completed by addition of a corresponding predictionstage as

yij(k | k − 1) =[1−Ωij(k)G

T (k)]F−T (k)yij(k − 1 | k − 1)

+Yij(k | k − 1)B(k)u(k)

Yij(k | k − 1) = Mij(k)−Ωij(k)Σij(k)ΩTij(k) (327)

where

Mij(k) = F−T (k)Yij(k − 1 | k − 1)F−1(k), Ωij(k) = Mij(k)G(k)Σ−1ij (k), (328)

andΣij(k) =

[GT (k)Mij(k)G(k) +Q−1(k)

]. (329)

The channel filter is simply an information-state estimator which generates estimates onthe basis of information communicated through a channel joining two adjacent nodes.The channel filter The effect of the channel filter is to provide information to nodes fromfurther afield in the network with no extra communication cost. If the network is strictlysynchronous (all nodes cycle at the same speed) then the information arriving at a specificnode will be time delayed in proportion to the number of nodes through which it mustpass. This gives rise to a characteristic triangular ‘wave-front’ in the information mapwhich describes the way information is used at any one node to obtain an estimate.

4.4.5 Delayed, Asequent and Burst Communication

The channel filter provides a mechanism for handling the practical problems of communi-cation between sensor nodes. A communication medium for a sensor network will alwaysbe subject to bandwidth limitations, and possibly intermittent or jammed communica-tions between nodes. This in turn gives rise to data that is delayed prior to fusion andpossibly data that is asequent (arriving out of temporal order). If problems of data de-lay and order can be resolved, then efficient strategies can be developed for managingcommunications in sensor networks.

At the heart of practical communication issues is the delayed data problem. Thesolution to the delayed data problem requires an ability to propagate information bothforward and backward in time using the information prediction Equations 250 and 251.


Consider the information matrix Y(k | k) obtained locally at a node at time k. Follow-ing Equation 251, this information matrix can be propagated forward in time n stepsaccording to

Y(k + n | k) = Mn(k)−Mn(k)Gn(k)Σ−1n (k)GT

n (k)Mn(k) (330)

whereMn(k) =

[F−T (k)

]nY(k | k) [F−1(k)]

n,

Σn(k) = GTn (k)Mn(k)Gn(k) +Q−1

n (k)

and where the subscript n denotes that the associated transition matrices are evaluatedfrom the integrals in Equation 73 over the interval n∆T . Likewise, from Equation 250the information-state vector can be propagated forward according to

y(k + n | k) =[1−Mn(k)Gn(k)Σ

−1n (k)GT

n (k)] [F−T (k)

]ny(k | k)

+Y(k + n | k)Bn(k)un(k). (331)

An equivalent expression can be obtained for propagating information backwards in timeas

Y(k − n | k) = M−n(k) +M−n(k)G−n(k)Σ−1−n(k)G

T−n(k)M−n(k) (332)

whereM−n(k) =

[FT (k)

]nY(k | k) [F(k)]n ,

Σ−n(k) = Q−1n (k)−GT

−n(k)M−n(k)G−n(k)

and where the subscript −n denotes that the associated transition matrices are evaluatedfrom the integrals in Equation 73 over the interval−n∆T . Similarly, the information-statevector can be propagated backward according to

y(k − n | k) =[1−M−n(k)G−n(k)Σ−1

−n(k)GT−n(k)

] [FT (k)

]ny(k | k)

+Y(k − n | k)B−n(k)un(k). (333)

The delayed data problem can now be solved using the following algorithm:

1. Back-propagate the estimate Y(k | k) to the time, k − n, at which the informationI(k − n) was obtained, using Equations 332 and 333.

2. Add the delayed data I(k − n) to the estimate Y(k − n | k) in the normal manner.

3. Propagate the new fused estimate back to time k using Equations 330 and 331.

It can be demonstrated that the net effect of this algorithm is to produce an estimate inthe form

Y(k | k) = YY (k | k) +YI(k | k) +YY I(k | k)


where YY (k | k) is the estimate obtained without the delayed information, YI(k | k) isthe estimate obtained using only the delayed information, and YY I(k | k) is a cross-information term (uniquely definedYY (k | k) andYI(k | k)) describing the cross-correlationbetween the information states caused by propagation through a common process model.It should be noted that YY I(k | k) is additive, so the obvious approximation

Y(k | k) = YY (k | k) +YI(k | k)is conservative.

Recall that the channel filter is simply a standard information filter used to maintainan estimate of common data passed through a particular channel. A channel filter onnode i connected to node j maintains the common information vector yij(k | k) and thecommon information matrix Yij(k | k).

The prediction equations (for both forward and backward propagation) used in thechannel filter are the same as Equations 330–333 described above. For the update stagea channel filter receives information estimates from other nodes. When this happens, thechannel filter predicts the received information to a local time horizon and then determinesthe new information at that time. The new information at the channel filter at node iwhen data arrives through the channel connected to node j is the information gain fromnode j

mj(k) = yj(k | k − n)− yij(k | k −m)

Mj(k) = Yj(k | k − n)−Yij(k | k −m) (334)

Note now that the time indices Yj(k | k − n) and Yij(k | k −m) are different as the dif-ferent nodes are asynchronous and the two information sets can be predicted over differenttime horizons. This information gain m(k) and M(k) is analogous to the observation in-formation vector i(k) and I(k), and is merely added in the update stage. The update forthe channel filter between nodes i and j when new information has been received fromnode j can then be written as

yij(k | k) = yij(k | k −m) +mj(k)

Yij(k | k) = Yij(k | k −m) +Mj(k) (335)

This can be further simplified by substituting Equation 334

yij(k | k) = yij(k | k −m) +mj(k)

yij(k | k) = yij(k | k −m) + yj(k | k − n)− yij(k | k −m)

yij(k | k) = yj(k | k − n) (336)

This reveals a trivial update stage where the channel filter merely overwrites the previousstate with the newer one. When a new information set arrives at node i from node j, thechannel filter is updated by

yij(k | k) = yj(k | k − n)

Yij(k | k) = Yj(k | k − n) (337)


Alternatively, if the new information at the channel filter of node i is from the local filterat node i, the update becomes

yij(k | k) = yi(k | k)Yij(k | k) = Yi(k | k) (338)

While the normal information filter update of Equation 335 is perfectly valid, implemen-tation in the form of Equation 337 and 338 is much simpler as there is no computation,just the overwriting of the previous state.

Together, the channel update equations allow data delayed from neighbouring nodesto be fused in an optimal manner. This together with local assimilation equations per-mits burst communication of estimates accumulated over a time period, in a mannerthat ensures no double counting of information. This also means that the channel filteralgorithms are robust to intermittent communication failure.

4.4.6 Management of Channel Operations

Σ

Σ

i(k) yi(k|k)~

Channel Filters

Prediction

PreprocessChannelManager

Sensor Node

yi(k|k)^

yi(k|k-1)^

yi(k|k)- yiq(k|k-n)^~

yi(k|k)- yip(k|k-n)^~

∆ypi(k|k), ∆yqi(k|k)^ ^

P

Q

Figure 43: Algorithmic structure of a decentralised sensing node.

Figure 43 shows a typical sensor node, i, in a decentralised data fusion system. Thenode generates information measures yi(k | k) at a time k given observations made locallyand information communicated to the node up to time k. The node implements a localprediction stage to produce information measure predictions yi(k | k − 1) at time k givenall local and communicated data up to time k− 1 (this prediction stage is often the sameon each node and may, for example, correspond to the path predictions of a number ofcommon targets). At this time, local observations produce local information measuresi(k) on the basis of local observations. The prediction and local information measuresare combined, by simple addition, into a total local information measure yi(k | k) attime k. This measure is handed down to the communication channels for subsequentcommunication to other nodes in the decentralised network. Incoming information fromother nodes yji(k | k) is extracted from appropriate channels and is assimilated with thetotal local information by simple addition. The result of this fusion is a locally availableglobal information measure yi(k | k). The algorithm then repeats recursively.


The communication channels exploit the associativity property of information mea-sures. The channels take the total local information yi(k | k) and subtract out all in-formation that has previously been communicated down the channel, yij(k | k), thustransmitting only new information obtained by node i since the last communication. In-tuitively, communicated data from node i thus consists only of information not previouslytransmitted to a node j; because common data has already been removed from the com-munication, node j can simply assimilate incoming information measures by addition.As these channels essentially act as information assimilation processes, they are usuallyreferred to as channel filters

Channel filters have two important characteristics:

1. Incoming data from remote sensor nodes is assimilated by the local sensor node be-fore being communicated on to subsequent nodes. Therefore, no matter the numberof incoming messages, there is only a single outgoing message to each node. Conse-quently, as the sensor network grows in size, the amount of information sent downany one channel remains constant. This leads to an important conclusion: A decen-tralised data fusion system can scale up in size indefinitely without any increase innode-to-node communication bandwidth.

2. A channel filter compares what has been previously communicated with the totallocal information at the node. Thus, if the operation of the channel is suspended,the filter simply accumulates information in an additive fashion. When the channelis re-opened, the total accumulated information in the channel is communicatedin one single message. The consequences of this are many-fold; burst transmissionof accumulate data can be employed to substantially reduce communication band-width requirements (and indeed be used to manage communications); if a node isdisconnected from the communications network, it can be re-introduced and in-formation synchronised in one step (the same applies to new nodes entering thesystem, dynamic network changes and signal jamming). This leads to a second im-portant conclusion: A decentralised data fusion system is robust to changes and lossof communication media.

Together, the node and channel operation define a system which is fully modular inalgorithmic construction, is indefinitely scalable in numbers of nodes and communicationtopology, and is survivable in the face of failures of both sensor nodes and communicationfacility.

To summarise, this algorithm provides a framework that enables information to flowthrough the network with a fixed communication overhead, independent of network size.Information flow is achieved without any additional communication over that required forthe simple nearest neighbour fusion algorithm described in Equations 306 and 307. Thisinvolves the transmission and reception of one information matrix and one information-state vector to and from each neighbouring node. The algorithm yields estimates equiv-alent to those of a centralised system with some data delays (due to the time taken forinformation to travel across the network).


4.4.7 Communication Topologies

The sensor networks considered in preceding sections were restricted to having only asingle path between any two nodes. Clearly such networks are not robust as failure ofany one channel or node will divide the network into two non-communicating halves.Networks which are to be made reliable must ensure that there are multiple, redundant,paths between sensor nodes.

However, the algorithms described in the previous section will not produce consistentinformation-state estimates for networks with multiple paths. The reason for this is thatthe channel filter yij(· | ·), which is intended to determine information common to nodesi and j, does so only by summing up information communicated through the channellinking these two nodes. The assumption is that there is only one path between twoneighbouring nodes so the information they have in common must have passed through thecommunication link connecting them. If information is communicated to both nodes i andj directly by some third node then this information will clearly be common informationheld by both i and j, but will not appear in the channel filter yij(· | ·) because it has notpassed through the channel between them.

There are three possible solutions to this problem

• Data Tagging: An obvious solution is to tag information as passes through thenetwork so that nodes can determine which comments are “new” and which havealready been communicated through other links. The problem with this method isthat it will not scale well with large data sets and with large sensor networks.

• Spanning Trees: A viable solution is to use an algorithm which dynamically selectsa minimal spanning tree for the network. This, in effect would allow redundantlinks to be artificially “broken” at run-time so allowing tree-based communicationalgorithms to function in a consistent manner. Links can be artificially broken bysetting yij(· | ·) = yj(· | ·), ensuring no information is communicated across the cut.The distributed Belman-Ford algorithm is one algorithm which allows such a treeto be constructed in a fully distributed or decentralised manner (it is commonlyused for internet routing). Overall system performance depends on reducing datadelays by choosing a topology with short paths between nodes. The best topologycan then be implemented by artificially breaking the necessary links and using thechannel filter algorithm. The network remains scalable although reconfigurationmay be necessary if nodes are to be added or removed. This may, at first, seem tobe a disadvantage but, should the network sustain damage to links or nodes sucha strategy will enable the surviving nodes to reconfigure to form a new topology,bringing formerly disused links into service to bypass damaged nodes or links. Sucha system will be more robust than a purely tree connected system which will bedivided into two non-communicating parts by the loss of a node or link.

• Dynamic Determination of Cross-Information: A promising approach inhighly dynamic large scale networks is to use a local algorithm which attemptsto determine how the information in two incoming channels is correlated. One such


method is the covariance intersect filter. This computes the relative alignment be-tween information matrices and produces a conservative local update based on theworst-case correlation between incoming messages. The advantage of this methodis that it is fully general and will work in any network topology. The disadvantagewith this method is that the estimates arrived at, while consistent, are often tooconservative and far from optimal.

4.5 Advanced Methods in Decentralised Data Fusion

Essential decentralised data fusion algorithms have been developed and extended in anumber of ways. These include model distribution methods [31], in which each localsensor node maintains only a part of the global state-space model, in sensor manage-ment methods [27], using natural entropic measures of information arising from the log-likelihood definitions intrinsic to the channel filter, and in system organisation. These arenot addressed further in this course.


5 Making Decisions

Low-level data fusion tasks are concerned with the problem of modeling and combininginformation. Having obtained this information it is usually necessary to make some kindof decision; to make an estimate of the true state, to perform a control action, or to decideto obtain more information. Higher level data fusion problems involve significant decisionmaking processes.

A decision model is a function which takes as an argument all the information cur-rently available to the system and which produces an action corresponding to the requireddecision. Central to the decision model is the idea of loss or utility. A loss function (orequivalently a utility function) provides a means of evaluating different actions, allow-ing for direct comparison of alternative decision rules. Given a loss function, it becomespossible to model the decision making process and to evaluate different data-fusion strate-gies. This section briefly defines the concepts of action loss or utility14 and by describingthe decision making process that results. Some important elements of statistical decisiontheory are then introduced with particular emphasis on multiple-person decision makingproblems. This provides some powerful techniques for modeling the organization andcontrol of information in complex data fusion systems.

5.1 Actions, Loss and Utility

Decision making begins with an unknown state of nature x ∈ X . An experiment is madewhich generates an outcome or observation z ∈ Z. On the basis of this observation anaction a ∈ A must be computed. The action may be to make an estimate of the truestate, in which case a ∈ X , but in general it is simply to choose one of a specified setof possible actions in A. A decision rule δ is simply a function taking observations intoactions; δ : Z −→ A, so that for every possible observation z ∈ Z, δ defines what actiona ∈ A must be taken; a = δ(z).

5.1.1 Utility or Loss

A loss function L(·, ·) (or corresponding utility function U(·, ·)) is defined as a mappingfrom pairs of states and actions to the real line;

L : X ×A −→ , U : X ×A −→ .

The interpretation of a loss function L(x, a) is that L is the loss incurred in taking theaction a when the true state of nature is x. Similarly, the interpretation of a utilityfunction U(x, a) is that U is the gain obtained in taking the action a when the true stateof nature is x. In principle, it should be the case that for a fixed state of nature, both

14Engineers, being pessimists by nature, tend to use the term ‘Loss’ to compare different decisions.Economists, being optimists by necessity, use the term ‘Utility’. With the obvious sign change, the termscan be used interchangeably and results derived with respect to one can always be interpreted in termsof the other.


utility and loss will induce a preference ordering on A. To ensure this, loss and utilityfunctions must obey four rules called the utility or ‘rationality axioms’ which guaranteea preference pattern. We state these rules here in terms of utility, following convention.An analogous set of rules applies to loss functions. For fixed x ∈ X :

1. Given any a1, a2 ∈ A, either U(x, a1) < U(x, a2), U(x, a1) = U(x, a2), or U(x, a1) >U(x, a2). That is, given any two actions we can assign real numbers which indicateour preferred alternative.

2. If U(x, a1) < U(x, a2) and U(x, a2) < U(x, a3) then U(x, a1) < U(x, a3). That is,if we prefer action a2 to a1 and action a3 to a2, then we must prefer action a3 toaction a1; the preference ordering is transitive.

3. If U(x, a1) < U(x, a2), then αU(x, a1) + (1 − α)U(x, a3) < αU(x, a2) + (1 −α)U(x, a3), for any 0 < α < 1. That is, if a2 is preferred to a1 then, in a choicebetween two random situations which are identical except that both a1 and a2 occurwith probability α, the situation involving a2 will be preferred.

4. If U(x, a1) < U(x, a2) < U(x, a3), there are numbers 0 < α < 1 and 0 < β < 1such that αU(x, a1)+(1−α)U(x, a3) < U(x, a2) and αU(x, a1)+(1−α)U(x, a3) <U(x, a2). That is, there is no infinitely desirable or infinitely bad reward (no heavenor hell).

A proof that these axioms imply the existence of a utility function can be found in [8].The first three axioms seem intuitively reasonable. The final axiom is rather more difficultto justify. Indeed, the well-known least-squares loss function does not satisfy this finalaxiom because it is not bounded from above. However, it turns out that the need forboundedness rarely affects a decision problem. One exception to this, in group decisionmaking problems, will be discussed in later sections.

It is well known and much discussed by economists that people do not always actrationally according to the ‘rationality axioms’ and that indeed they normally have con-siderable problems constructing any kind of consistent utility function by which to judgedecisions. The definition of rationality does not really concern us when dealing with adata fusion system that consists of deterministic algorithms; we can always enforce thedefinition of rationality chosen. What is of importance is the construction of the lossfunction itself. It is, in principle, important to ensure that the loss (or utility) functionemployed truly represents the value we place on different decisions. Constructing such aloss function can be quite difficult (see [8] for a systematic approach to the constructionof utility functions). Fortunately however, many of the decision rules we will have causeto consider are insensitive to the exact form of the loss function employed provided it isrestricted to a certain class; all symmetric loss functions for example.

5.1.2 Expected Utility or Loss

Except in case of perfect knowledge, a loss function in the form of L(x, a) is not very usefulbecause the true state of nature x ∈ X will not be known with precision and so the true


loss incurred in taking an action will not be known. Rather, there will be a probabilitydistribution P (x) summarizing all the (probabilistic) information we have about the stateat the time of decision making. With this information, one natural method of defining lossis as an expected loss (or Bayes expected loss) which for continuous-valued state variablesis simply

β(a)= EL(x, a) =

∫ ∞

−∞L(x, a)P (x)dx, (339)

and for discrete-valued state variables is given by

ρ(a) == EL(x, a) =

∑x∈X

L(x, a)P (x). (340)

Clearly, Bayes expected loss simply weights the loss incurred by the probability of occur-rence (an average loss). More pessimistic approaches, such as game-theoretic methods,could also be used (this depends also on the risk).

5.1.3 Bayesian Decision Theory

It is most likely that probabilistic information concerning x is obtained after taking anumber of observations, in the form of a posterior distribution P (x | Zn). An expectedutility (or loss) β following an action a may then be defined with respect to a specifiedposterior distribution in the true state, as

β(P (x | Zn), a) = EU(x, a)=

∫xU(x, a) P (x | Zn) dx. (341)

The Bayes action a is defined as the strategy which maximises the posterior expectedutility

a = argmaxa

β(P (x | Zn), a) (342)

It can be shown that this is equivalent to maximising∫xU(x, a) P (Zn | x) P (x) dx. (343)

Many well-known decision theoretic principles can be stated in terms of a Bayes action.For example, in estimation problems, the action set is made equal to the set of possiblestates of nature (A = X ). In particular, the MMSE estimate defined by

x = arg mina∈X

∫x(x− a)T (x− a)P (x | Zn)dx

= arg mina∈X

∫xL(x, a)P (x | Zn)dx, (344)

is clearly a Bayes action with respect to a squared error loss function defined by L(x, x) =(x− x)T (x− x).


5.2 Decision Making with Multiple Information Sources

Decision-making with multiple sources of information is fundamentally more complexthan the problem of single source decision making. The essential reason for this is thatit is difficult to provide a consistent measure of utility or loss for each decision maker.In particular, two basic problems exist. First, how can the utilities of two differentdecision makers be compared unless there is a common measure of value. This is knownas the problem of inter-personal utility comparison. Second, should decision makers aimto maximize a utility which expresses only local preferences, or should each decisionmaker evaluate its actions with respect to some common or group utility function. In theliterature on the subject, no single agreed solution exists for these two problems. However,for specific decision making problems in which concepts of utility may be simplified andmade precise, it is possible to arrive at consistent and useful solutions to both the utilitycomparison and the group utility problems. These are briefly discussed here.

The problem of making decisions involving multiple information sources has beenformulated under a variety of assumptions in the literature [10][26]. We shall consider thefollowing representative cases:

5.2.1 The Super Bayesian

Case 1: Consider a system consisting of N information sources and a single overalldecision maker. The task of the decision maker is, in the first instance, to combineprobabilistic information from all the sources and then to make decisions based on theglobal posterior. Given that the global posterior is P (x | Zk), the Bayes group action isgiven by

a = argmaxa

β(P (x | Zk), a)= argmax

aEU(x, a) | Zk , (345)

where U(x, a) is a group utility function. The solution in this case is well defined in termsof classical Bayesian analysis and is often termed the “super Bayesian” approach [44].

5.2.2 Multiple Bayesians

Case 2: Consider a system consisting of N Bayesians each able to obtain its own prob-abilistic information which it shares with all the other Bayesians before computing a pos-terior PDF. From the previous discussion, the posterior PDF obtained by each Bayesiani is identical and is given by P (xi | Zki ). Each Bayesian is then required to compute anoptimal action which is consistent with those of the other Bayesians.

This second case describes a fully decentralised system. It presents a considerable chal-lenge for which a universally prescribed solution is not possible. This is partly because ofthe lack of generally applicable criteria for determining rationality and optimality, coupledwith the question of whether to pursue group or individual optimality. The trivial case


exists where the optimal action ai at each Bayesian i happens to be the same for all theBayesians and thus becomes the group action. In general local decisions are not the same,and so a solution proceeds as follows; each Bayesian i computes an acceptable (admissi-ble) set of actions Ai ⊆ A and if the set A = ∩jAj is non-empty, then the group action isselected from the set A. An acceptable class of actions can be obtained by maximising

β(P (x | Zk), a) =∑j

wj EUj(x, a) | Zk (346)

where 0 ≤ wj ≤ 1 and∑j wj = 1. Equivalently, from the Likelihood Principle (Equa-

tion 343), this can be written as a maximisation of

∑j

wj

∫Uj(x, a) P (Zkj | x) P (x) dx. (347)

Although it is clear that Equation 346 and Equation 347 represent acceptable actions, itis extremely difficult to choose a single optimal action a from this acceptable set. Thereal problem is that the individual utilities Ui can not, in general, be compared as theymay correspond to completely different scales of value. If each local utility function Uiwere described on a common scale, then maximisation of Equation 346 or Equation 347clearly gives the optimal group action. The general problem of comparing local utilitiesto arrive at an optimal group decision is known to be unsolvable (Arrow’s impossibilitytheorem [3]). However it is also known that with a few additional restrictions on the formof Ui, such comparisons can be made to give sensible results.

In the case where comparison between local utilities can be justified, a solution to thegroup decision problem may be obtained by maximising Equation 346. A rather moregeneral solution to the group decision problem in this case has been suggested in [44].The optimal group decision is obtained from

a = argmaxa

∑j

wj [EUj(x, a) − c(j)]γ

1/γ

, (348)

where c(j) is regarded as decision maker j’s security level which plays the role of “safe-guarding j’s interests”. The weight wj is as given in Equation 346 and −∞ ≤ γ ≤ ∞. Thecase when γ = 1, gives a solution which minimises Equation 346 and γ = −∞ and γ =∞give the so-called Bayesian max-min and min-max solutions respectively [8][44]. The so-lution arising from using γ = 1 and γ = ∞ can result in an individual decision-makerending up with a net loss of expected utility, depending on the value of c(j).

5.2.3 Nash Solutions

One soloution to the case when direct comparisons of utility cannot be justified is the wellknown Nash product [32] . The Nash product consists simply of a product of individual


utilities. In this case, the absolute scale of each individual utility has no effect on theoptimal action chosen. The Nash solution may be obtained from

a = argmaxa

∏j

[EUj(x, a) − c(j)] , (349)

in which the value of c(j) can be used to maintain some level of individual optimality.The value of c(j) plays no part in the actual derivation of the Nash solution and is thusarbitrary.

In many situations, it may be that the various decision makers do not come to a singleunified opinion on what the global decision should be. Such a situation occurs when thelocal utility of a global decision is much poorer than the value that would be obtainedby “agreeing to disagree”. This leads to a range of further issues in cooperative gameplaying and bargaining.


References

[1] D.L. Alspace and H.W. Sorenson. Gaussian filters for nonlinear filtering problems. IEEETrans. Automatic Control, 17(2):439–448, 1972.

[2] B.D.O. Anderson and J.B. Moore. Optimal Filtering. Prentice Hall, 1979.

[3] K.J. Arrow. Social Choice and Individual Values. Wiley, 1966.

[4] Y. Bar-Shalom. On the track to track correlation problem. IEEE Trans. Automatic Control,25(8):802–807, 1981.

[5] Y. Bar-Shalom. Multi-Target Multi-Sensor Tracking. Artec House, 1990.

[6] Y. Bar-Shalom. Multi-Target Multi-Sensor Tracking II. Artec House, 1990.

[7] Y. Bar-Shalom and T.E. Fortmann. Tracking and Data Association. Academic Press, 1988.

[8] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer Verlag, 1985.

[9] S. Blackman and R. Popoli. Design and Analysis of Modern Tracking Systems. ArtecHouse, 1999.

[10] D. Blackwell and M. A. Girshick. Theory of Games and Statistical Decisions. DoverPublications, 1954.

[11] P. L. Bogler. Shafter-dempster reasoning with applications to multisensor target identifi-cation systems. IEEE Trans. Systems Man and Cybernetics, 17(6):968–977, 1987.

[12] R.G. Brown and P.Y.C. Hwang. Introduction to Applied Kalman Filtering. John Wiley,1992.

[13] D.M. Buede. Shafer-Dempster and Bayesian reasoning: a response to Shafer-Dempster rea-soning with applications to multisensor target identification systems. IEEE Trans. SystemsMan and Cybernetics, 18(6):1009–1011, 1988.

[14] D.E. Catlin. Estimation, Control and the Discrete Kalman Filter. Springer Verlag, 1984.

[15] C. Chong, K. Chang, and Y. Bar-Shalom. Joint probabilistic data association in distributedsensor networks. IEEE Trans. Automatic Control, 31(10):889–897, 1986.

[16] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley, 1991.

[17] D. Dubois and H. Prade. Fuzzy Sets and Systems: Theory and Applications. AcademicPress, 1980.

[18] H.F. Durrant-Whyte. Introduction to Estimation and The Kalman Filter. Australian Centrefor Field Robotics, 2000.

[19] S.Y. Harmon. Sensor data fusion through a blackboard. In Proc. IEEE Int. Conf. Roboticsand Automation, page 1449, 1986.


[20] C.J. Harris and I. White. Advances in Command, Control and Communication Systems.IEE press, 1987.

[21] H.R. Hashemipour, S. Roy, and A.J. Laub. Decentralized structures for parallel Kalmanfiltering. IEEE Trans. Automatic Control, 33(1):88–93, 1988.

[22] K. Ito and K. Xiong. Gaussian filters for nonlinear filtering problems. IEEE Trans. Auto-matic Control, 45(5):910–927, 2000.

[23] S. Julier, J. Uhlmann, and H.F. Durrant-Whyte. A new method for the non-linear approxi-mation of means and covariances in filters and estimators. IEEE Trans. Automatic Control,45(3):477–481, 2000.

[24] T. Kilath. Linear Systems. John Wiley, 1980.

[25] P. Lancaster and M. Tismenetsky. The Theory of Matrices. Academic Press, 1985.

[26] R.D. Luce and H. Raiffa. Games and Decisions. Wiley, 1957.

[27] J. Manyika and H.F. Durrant-Whyte. Data Fusion and Sensor Management: AnInformation-Theoretic Approach. Prentice Hall, 1994.

[28] P.S. Maybeck. Stochastic Models, Estimaton and Control, Vol. I. Academic Press, 1979.

[29] V. Mazya and G. Schmidt. On approximating approximations using gaussian kernels. IMAJ. Numerical Anal., 16:13–29, 1996.

[30] R.E. Moore. Interval Analysis. Prentice Hall, 1966.

[31] A.G.O Mutambura. Decentralised Estimation and Control for Multisensor Systems. CRCPress, 1998.

[32] J.F. Nash. The bargaining problem. Econometrica, page 155, 1950.

[33] H.P. Nii. Blackboard systems. AI Magazine, 1986.

[34] A. Papoulis. Probability, Random Variables, and Stochastic Processes; Third Edition.McGraw-Hill, 1991.

[35] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausable Inference.Morgan Kaufmann Publishers Inc., 1988.

[36] C.R. Rao. Linear Statistical Inference and its Applications. John Wiley, 1965.

[37] N.R. Sandell, P. Varaiya, M. Athans, and M.G. Safonov. Survey of decentralized controlmethods for large scale systems. IEEE Trans. Automatic Control, 23(2):108–128, 1978.

[38] H.W. Sorenson. Special issue on applications of Kalman filtering. IEEE Trans. AutomaticControl, 28(3), 1983.

[39] J.L. Speyer. Communication and transmission requirments for a decentralized linear-quadratic-gaussian control problem. IEEE Trans. Automatic Control, 24(2):266–269, 1979.


[40] L.D. Stone, C.A. Barlow, and T.L. Corwin. Bayesian Multiple Target Tracking. ArtechHouse, 1999.

[41] M. Stone. Coordinate Free Multivaribale Statistics. Oxford University Press, 1989.

[42] G. Strang. Linear Algebra and its Applications, Third Edition. Harcourt Brace Jovanovich,1988.

[43] E.L. Waltz and J. Llinas. Sensor Fusion. Artec House, 1991.

[44] S. Weerahandi and J.V. Zidek. Elements of multi-Bayesian decision theory. The Annals ofStatistics, 11(4):1032–1046, 1983.

Multi Sensor Data Fusion - mit.bme.hu

Documents

Transcript of Multi Sensor Data Fusion - mit.bme.hu