Filtering with State-Observation Examples via Kernel Monte ...

63
LETTER Communicated by Sumeetpal Singh Filtering with State-Observation Examples via Kernel Monte Carlo Filter Motonobu Kanagawa [email protected] SOKENDAI (Graduate University for Advanced Studies), Tokyo 190-8562, Japan, and Institute of Statistical Mathematics, Tokyo 190-8562, Japan Yu Nishiyama [email protected] University of Electro-Communications, Tokyo 182-8585, Japan Arthur Gretton [email protected] Gatsby Computational Neuroscience Unit, University College London, London Kenji Fukumizu [email protected] SOKENDAI (Graduate University for Advanced Studies), Tokyo 190-8562, Japan, and Institute of Statistical Mathematics, Tokyo 190-8562, Japan This letter addresses the problem of filtering with a state-space model. Standard approaches for filtering assume that a probabilistic model for observations (i.e., the observation model) is given explicitly or at least parametrically. We consider a setting where this assumption is not satis- fied; we assume that the knowledge of the observation model is provided only by examples of state-observation pairs. This setting is important and appears when state variables are defined as quantities that are very different from the observations. We propose kernel Monte Carlo filter, a novel filtering method that is focused on this setting. Our approach is based on the framework of kernel mean embeddings, which enables nonparametric posterior inference using the state-observation examples. The proposed method represents state distributions as weighted samples, propagates these samples by sampling, estimates the state posteriors by kernel Bayes’ rule, and resamples by kernel herding. In particular, the sampling and resampling procedures are novel in being expressed using kernel mean embeddings, so we theoretically analyze their behaviors. We reveal the following properties, which are similar to those of corre- sponding procedures in particle methods: the performance of sampling can degrade if the effective sample size of a weighted sample is small, and resampling improves the sampling performance by increasing the Neural Computation 28, 382–444 (2016) c 2016 Massachusetts Institute of Technology doi:10.1162/NECO_a_00806

Transcript of Filtering with State-Observation Examples via Kernel Monte ...

Page 1: Filtering with State-Observation Examples via Kernel Monte ...

LETTER Communicated by Sumeetpal Singh

Filtering with State-Observation Examples via KernelMonte Carlo Filter

Motonobu KanagawamotonobukanagawagmailcomSOKENDAI (Graduate University for Advanced Studies) Tokyo 190-8562 Japanand Institute of Statistical Mathematics Tokyo 190-8562 Japan

Yu NishiyamayunishiyamaaiisuecacjpUniversity of Electro-Communications Tokyo 182-8585 Japan

Arthur GrettonarthurgrettongmailcomGatsby Computational Neuroscience Unit University College London London

Kenji FukumizufukumizuismacjpSOKENDAI (Graduate University for Advanced Studies) Tokyo 190-8562 Japanand Institute of Statistical Mathematics Tokyo 190-8562 Japan

This letter addresses the problem of filtering with a state-space modelStandard approaches for filtering assume that a probabilistic model forobservations (ie the observation model) is given explicitly or at leastparametrically We consider a setting where this assumption is not satis-fied we assume that the knowledge of the observation model is providedonly by examples of state-observation pairs This setting is importantand appears when state variables are defined as quantities that are verydifferent from the observations We propose kernel Monte Carlo filtera novel filtering method that is focused on this setting Our approachis based on the framework of kernel mean embeddings which enablesnonparametric posterior inference using the state-observation examplesThe proposed method represents state distributions as weighted samplespropagates these samples by sampling estimates the state posteriors bykernel Bayesrsquo rule and resamples by kernel herding In particular thesampling and resampling procedures are novel in being expressed usingkernel mean embeddings so we theoretically analyze their behaviorsWe reveal the following properties which are similar to those of corre-sponding procedures in particle methods the performance of samplingcan degrade if the effective sample size of a weighted sample is smalland resampling improves the sampling performance by increasing the

Neural Computation 28 382ndash444 (2016) ccopy 2016 Massachusetts Institute of Technologydoi101162NECO_a_00806

Filtering with State-Observation Examples 383

effective sample size We first demonstrate these theoretical findings bysynthetic experiments Then we show the effectiveness of the proposedfilter by artificial and real data experiments which include vision-basedmobile robot localization

1 Introduction

Time-series data are ubiquitous in science and engineering We often wishto extract useful information from such time-series data State-space mod-els have been one of the most successful approaches for this purpose (seeDurbin amp Koopman 2012) Suppose that we have a sequence of observa-tions y1 yt yT A state-space model assumes that for each obser-vation yt there is a hidden state xt that generates yt and that these statesx1 xt xT follow a Markov process (see Figure 1) Therefore the state-space model is characterized by two components (1) an observation modelp(yt |xt ) the conditional distribution of an observation given a state and (2)a transition model p(xt |xtminus1) the conditional distribution of a state giventhe previous one

This letter addresses the problem of filtering a central topic in the liter-ature on state-space models The task is to estimate a posterior distributionof the state for each time t based on observations up to that time

p(xt |y1 yt ) t = 1 2 T (11)

The estimation is to be done online (sequentially) as each yt is received Forexample a tracking problem can be formulated as filtering where xt is theposition of an object to be tracked and yt is a noisy observation of xt (RisticArulampalam amp Gordon 2004)

As an inference problem the starting point of filtering is that the obser-vation model p(yt |xt ) and the transition model p(xt |xtminus1) are given in someform The simplest form is a linear-gaussian state-space model which en-ables analytic computation of the posteriors this is the principle of the clas-sical Kalman filter (Kalman 1960) The filtering problem is more difficultif the observation and transition models involve nonlinear transformationand nongaussian noise Standard solutions for such situations include ex-tended and unscented Kalman filters (Anderson amp Moore 1979 Julier ampUhlmann 1997 2004) and particle filters (Gordon Salmond amp Smith 1993Doucet Freitas amp Gordon 2001 Doucet amp Johansen 2011) Particle filtersin particular have wide applicability since they require only that (unnor-malized) density values of the observation model are computable and thatsampling with the transition model is possible Thus particle methods areapplicable to basically any nonlinear nongaussian state-space models andhave been used in various fields such as computer vision robotics andcomputational biology (see Doucet et al 2001)

384 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 1 Graphical representation of a state-space model y1 yT denoteobservations and x1 xT denote states The states are hidden and to beestimated from the observations

However it can be restrictive even to assume that the observation modelp(yt |xt ) is given as a probabilistic model An important point is that in prac-tice we may define the states x1 xT arbitrarily as quantities that we wishto estimate from available observations y1 yT Thus if these quantitiesare very different from the observations the observation model may notadmit a simple parametric form For example in location estimation prob-lems in robotics states are locations in a map while observations are sensordata such as camera images and signal strength measurements of a wire-less device (Vlassis Terwijn amp Krose 2002 Wolf Burgard amp Burkhardt2005 Ferris Hahnel amp Fox 2006) In brain-computer interface applicationsstates are defined as positions of a device to be manipulated while observa-tions are brain signals (Pistohl Ball Schulze-Bonhage Aertsen amp Mehring2008 Wang Ji Miller amp Schalk 2011) In these applications it is hard todefine the observation model as a probabilistic model in parametric form

For such applications where the observation model is very complicatedinformation about the relation between states and observations is given asexamples of state-observation pairs (XiYi) such examples are often avail-able before conducting filtering in test phase For example one can collectlocation-sensor examples for the location estimation problems by makinguse of more expensive sensors than those for filtering (Quigley StavensCoates amp Thrun 2010) The brain-computer interface problems also allowus to obtain training samples for the relation between device positions andbrain signals (Schalk et al 2007) However making use of such examplesfor learning the observation model is not straightforward If one relies ona parametric approach it would require exhaustive efforts for designinga parametric model to fit the complicated (true) observation model Non-parametric methods such as kernel density estimation (Silverman 1986)on the other hand suffer from the curse of dimensionality when applied tohigh-dimensional observations Moreover observations may be suitable tobe represented as structured (nonvectorial) data as for the cases of imageand text Such situations are not straightforward for either approach sincethey usually require that data is given as real vectors

11 Kernel Monte Carlo Filter In this letter we propose a filter-ing method that is focused on situations where the information of the

Filtering with State-Observation Examples 385

observation model p(yt |xt ) is given only through the state-observation ex-amples (XiYi) The proposed method which we call the kernel MonteCarlo filter (KMCF) is applicable when the following are satisfied

1 Positive-definite kernels (reproducing kernels) are defined on thestates and observations Roughly a positive-definite kernel is a sim-ilarity function that takes two data points as input and outputs theirsimilarity value

2 Sampling with the transition model p(xt |xtminus1) is possible This is thesame assumption as for standard particle filters the probabilisticmodel can be arbitrarily nonlinear and nongaussian

The past decades of research on kernel methods have yielded numerouskernels for real vectors and for structured data of various types (Scholkopfamp Smola 2002 Hofmann Scholkopf amp Smola 2008) Examples includekernels for images in computer vision (Lazebnik Schmid amp Ponce 2006)graph-structured data in bioinformatics (Scholkopf et al 2004) and ge-nomic sequences (Schaid 2010a 2010b) Therefore we can apply KMCFto such structured data by making use of the kernels developed in thesefields On the other hand this letter assumes that the transition modelis given explicitly we do not discuss parameter learning (for the caseof a parametric transition model) and we assume that parameters arefixed

KMCF is based on probability representations provided by the frame-work of kernel mean embeddings a recent development in the field ofkernel methods (Smola Gretton Song amp Scholkopf 2007 SriperumbudurGretton Fukumizu Scholkopf amp Lanckriet 2010 Song Fukumizu amp Gret-ton 2013) In this framework any probability distribution is represented as auniquely associated function in a reproducing kernel Hilbert space (RKHS)which is known as a kernel mean This representation enables us to esti-mate a distribution of interest by alternatively estimating the correspondingkernel mean One significant feature of kernel mean embeddings is kernelBayesrsquo rule (Fukumizu Song amp Gretton 2011 2013) by which KMCF es-timates posteriors based on the state-observation examples Kernel Bayesrsquorule has the following properties First it is theoretically grounded andis proven to get more accurate as the number of the examples increasesSecond it requires neither parametric assumptions nor heuristic approxi-mations for the observation model Third similar to other kernel methodsin machine learning kernel Bayesrsquo rule is empirically known to performwell for high-dimensional data when compared to classical nonparametricmethods KMCF inherits these favorable properties

KMCF sequentially estimates the RKHS representation of the posterior(see equation 11) in the form of weighted samples This estimation consistsof three steps prediction correction and resampling Suppose that wealready obtained an estimate for the posterior of the previous time In theprediction step this previous estimate is propagated forward by sampling

386 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

with the transition model in the same manner as the sampling procedure of aparticle filter The propagated estimate is then used as a prior for the currentstate In the correction step kernel Bayesrsquo rule is applied to obtain a posteriorestimate using the prior and the state-observation examples (XiYi)n

i=1Finally in the resampling step an approximate version of kernel herding(Chen Welling amp Smola 2010) is applied to obtain pseudosamples fromthe posterior estimate Kernel herding is a greedy optimization method togenerate pseudosamples from a given kernel mean and searches for thosesamples from the entire space X Our resampling algorithm modifies thisand searches for pseudosamples from a finite candidate set of the statesamples X1 Xn sub X The obtained pseudosamples are then used inthe prediction step of the next iteration

While the KMCF algorithm is inspired by particle filters there are severalimportant differences First a weighted sample expression in KMCF is anestimator of the RKHS representation of a probability distribution whilethat of a particle filter represents an empirical distribution This differencecan be seen in the fact that weights of KMCF can take negative valueswhile weights of a particle filter are always positive Second to estimate aposterior KMCF uses the state-observation examples (XiYi)n

i=1 and doesnot require the observation model itself while a particle filter makes use ofthe observation model to update weights In other words KMCF involvesnonparametric estimation of the observation model while a particle filterdoes not Third KMCF achieves resampling based on kernel herding whilea particle filter uses a standard resampling procedure with an empiricaldistribution We use kernel herding because the resampling procedure ofparticle methods is not appropriate for KMCF as the weights in KMCF maytake negative values

Since the theory of particle methods cannot be used to justify our ap-proach we conduct the following theoretical analysis

bull We derive error bounds for the sampling procedure in the predictionstep in section 51 This justifies the use of the sampling procedurewith weighted sample expressions of kernel mean embeddings Thebounds are not trivial since the weights of kernel mean embeddingscan take negative values

bull We discuss how resampling works with kernel mean embeddings (seesection 52) It improves the estimation accuracy of the subsequentsampling procedure by increasing the effective sample size of anempirical kernel mean This mechanism is essentially the same asthat of a particle filter

bull We provide novel convergence rates of kernel herding when pseu-dosamples are searched from a finite candidate set (see section 53)This justifies our resampling algorithm This result may be of inde-pendent interest to the kernel community as it describes how kernelherding is often used in practice

Filtering with State-Observation Examples 387

bull We show the consistency of the overall filtering procedure of KMCFunder certain smoothness assumptions (see section 54) KMCFprovides consistent posterior estimates as the number of state-observation examples (XiYi)n

i=1 increases

The rest of the letter is organized as follows In section 2 we reviewrelated work Section 3 is devoted to preliminaries to make the letter self-contained we review the theory of kernel mean embeddings Section 4presents the kernel Monte Carlo filter and section 5 shows theoretical re-sults In section 6 we demonstrate the effectiveness of KMCF by artificialand real-data experiments The real experiment is on vision-based mobilerobot localization an example of the location estimation problems men-tioned above The appendixes present two methods for reducing KMCFcomputational costs

This letter expands on a conference paper by Kanagawa NishiyamaGretton and Fukumizu (2014) It differs from that earlier work in that itintroduces and justifies the use of kernel herding for resampling The re-sampling step allows us to control the effective sample size of an empiricalkernel mean an important factor that determines the accuracy of the sam-pling procedure as in particle methods

2 Related Work

We consider the following setting First the observation model p(yt |xt ) isnot known explicitly or even parametrically Instead state-observation ex-amples (XiYi) are available before the test phase Second sampling fromthe transition model p(xt |xtminus1) is possible Note that standard particle filterscannot be applied to this setting directly since they require the observationmodel to be given as a parametric model

As far as we know a few methods can be applied to this setting directly(Vlassis et al 2002 Ferris et al 2006) These methods learn the observationmodel from state-observation examples nonparametrically and then use itto run a particle filter with a transition model Vlassis et al (2002) proposedto apply conditional density estimation based on the k-nearest neighborsapproach (Stone 1977) for learning the observation model A problem hereis that conditional density estimation suffers from the curse of dimension-ality if observations are high-dimensional (Silverman 1986) Vlassis et al(2002) avoided this problem by estimating the conditional density functionof a state given an observation and used it as an alternative for the obser-vation model This heuristic may introduce bias in estimation howeverFerris et al (2006) proposed using gaussian process regression for learningthe observation model This method will perform well if the gaussian noiseassumption is satisfied but cannot be applied to structured observations

There exist related but different problem settings from ours One situa-tion is that examples for state transitions are also given and the transition

388 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

model is to be learned nonparametrically from these examples For this set-ting there are methods based on kernel mean embeddings (Song HuangSmola amp Fukumizu 2009 Fukumizu et al 2011 2013) and gaussian pro-cesses (Ko amp Fox 2009 Deisenroth Huber amp Hanebeck 2009) The filteringmethod by Fukumizu et al (2011 2013) is in particular closely related toKMCF as it also uses kernel Bayesrsquo rule A main difference from KMCF isthat it computes forward probabilities by kernel sum rule (Song et al 20092013) which nonparametrically learns the transition model from the statetransition examples While the setting is different from ours we compareKMCF with this method in our experiments as a baseline

Another related setting is that the observation model itself is given andsampling is possible but computation of its values is expensive or evenimpossible Therefore ordinary Bayesrsquo rule cannot be used for filtering Toovercome this limitation Jasra Singh Martin and McCoy (2012) and Calvetand Czellar (2015) proposed applying approximate Bayesian computation(ABC) methods For each iteration of filtering these methods generate state-observation pairs from the observation model Then they pick some pairsthat have close observations to the test observation and regard the statesin these pairs as samples from a posterior Note that these methods arenot applicable to our setting since we do not assume that the observationmodel is provided That said our method may be applied to their setting bygenerating state-observation examples from the observation model Whilesuch a comparison would be interesting this letter focuses on comparisonamong the methods applicable to our setting

3 Kernel Mean Embeddings of Distributions

Here we briefly review the framework of kernel mean embeddings Fordetails we refer to the tutorial papers (Smola et al 2007 Song et al 2013)

31 Positive-Definite Kernel and RKHS We begin by introducingpositive-definite kernels and reproducing kernel Hilbert spaces details ofwhich can be found in Scholkopf and Smola (2002) Berlinet and Thomas-Agnan (2004) and Steinwart and Christmann (2008)

Let X be a set and k X times X rarr R be a positive-definite (pd) kernel1

Any positive-definite kernel is uniquely associated with a reproducing ker-nel Hilbert space (RKHS) (Aronszajn 1950) Let H be the RKHS associatedwith k The RKHS H is a Hilbert space of functions on X that satisfies thefollowing important properties

1A symmetric kernel k X times X rarr R is called positive definite (pd) if for all n isin Nc1 cn isin R and X1 Xn isin X we have

nsumi=1

nsumj=1

cic jk(Xi Xj ) ge 0

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 2: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 383

effective sample size We first demonstrate these theoretical findings bysynthetic experiments Then we show the effectiveness of the proposedfilter by artificial and real data experiments which include vision-basedmobile robot localization

1 Introduction

Time-series data are ubiquitous in science and engineering We often wishto extract useful information from such time-series data State-space mod-els have been one of the most successful approaches for this purpose (seeDurbin amp Koopman 2012) Suppose that we have a sequence of observa-tions y1 yt yT A state-space model assumes that for each obser-vation yt there is a hidden state xt that generates yt and that these statesx1 xt xT follow a Markov process (see Figure 1) Therefore the state-space model is characterized by two components (1) an observation modelp(yt |xt ) the conditional distribution of an observation given a state and (2)a transition model p(xt |xtminus1) the conditional distribution of a state giventhe previous one

This letter addresses the problem of filtering a central topic in the liter-ature on state-space models The task is to estimate a posterior distributionof the state for each time t based on observations up to that time

p(xt |y1 yt ) t = 1 2 T (11)

The estimation is to be done online (sequentially) as each yt is received Forexample a tracking problem can be formulated as filtering where xt is theposition of an object to be tracked and yt is a noisy observation of xt (RisticArulampalam amp Gordon 2004)

As an inference problem the starting point of filtering is that the obser-vation model p(yt |xt ) and the transition model p(xt |xtminus1) are given in someform The simplest form is a linear-gaussian state-space model which en-ables analytic computation of the posteriors this is the principle of the clas-sical Kalman filter (Kalman 1960) The filtering problem is more difficultif the observation and transition models involve nonlinear transformationand nongaussian noise Standard solutions for such situations include ex-tended and unscented Kalman filters (Anderson amp Moore 1979 Julier ampUhlmann 1997 2004) and particle filters (Gordon Salmond amp Smith 1993Doucet Freitas amp Gordon 2001 Doucet amp Johansen 2011) Particle filtersin particular have wide applicability since they require only that (unnor-malized) density values of the observation model are computable and thatsampling with the transition model is possible Thus particle methods areapplicable to basically any nonlinear nongaussian state-space models andhave been used in various fields such as computer vision robotics andcomputational biology (see Doucet et al 2001)

384 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 1 Graphical representation of a state-space model y1 yT denoteobservations and x1 xT denote states The states are hidden and to beestimated from the observations

However it can be restrictive even to assume that the observation modelp(yt |xt ) is given as a probabilistic model An important point is that in prac-tice we may define the states x1 xT arbitrarily as quantities that we wishto estimate from available observations y1 yT Thus if these quantitiesare very different from the observations the observation model may notadmit a simple parametric form For example in location estimation prob-lems in robotics states are locations in a map while observations are sensordata such as camera images and signal strength measurements of a wire-less device (Vlassis Terwijn amp Krose 2002 Wolf Burgard amp Burkhardt2005 Ferris Hahnel amp Fox 2006) In brain-computer interface applicationsstates are defined as positions of a device to be manipulated while observa-tions are brain signals (Pistohl Ball Schulze-Bonhage Aertsen amp Mehring2008 Wang Ji Miller amp Schalk 2011) In these applications it is hard todefine the observation model as a probabilistic model in parametric form

For such applications where the observation model is very complicatedinformation about the relation between states and observations is given asexamples of state-observation pairs (XiYi) such examples are often avail-able before conducting filtering in test phase For example one can collectlocation-sensor examples for the location estimation problems by makinguse of more expensive sensors than those for filtering (Quigley StavensCoates amp Thrun 2010) The brain-computer interface problems also allowus to obtain training samples for the relation between device positions andbrain signals (Schalk et al 2007) However making use of such examplesfor learning the observation model is not straightforward If one relies ona parametric approach it would require exhaustive efforts for designinga parametric model to fit the complicated (true) observation model Non-parametric methods such as kernel density estimation (Silverman 1986)on the other hand suffer from the curse of dimensionality when applied tohigh-dimensional observations Moreover observations may be suitable tobe represented as structured (nonvectorial) data as for the cases of imageand text Such situations are not straightforward for either approach sincethey usually require that data is given as real vectors

11 Kernel Monte Carlo Filter In this letter we propose a filter-ing method that is focused on situations where the information of the

Filtering with State-Observation Examples 385

observation model p(yt |xt ) is given only through the state-observation ex-amples (XiYi) The proposed method which we call the kernel MonteCarlo filter (KMCF) is applicable when the following are satisfied

1 Positive-definite kernels (reproducing kernels) are defined on thestates and observations Roughly a positive-definite kernel is a sim-ilarity function that takes two data points as input and outputs theirsimilarity value

2 Sampling with the transition model p(xt |xtminus1) is possible This is thesame assumption as for standard particle filters the probabilisticmodel can be arbitrarily nonlinear and nongaussian

The past decades of research on kernel methods have yielded numerouskernels for real vectors and for structured data of various types (Scholkopfamp Smola 2002 Hofmann Scholkopf amp Smola 2008) Examples includekernels for images in computer vision (Lazebnik Schmid amp Ponce 2006)graph-structured data in bioinformatics (Scholkopf et al 2004) and ge-nomic sequences (Schaid 2010a 2010b) Therefore we can apply KMCFto such structured data by making use of the kernels developed in thesefields On the other hand this letter assumes that the transition modelis given explicitly we do not discuss parameter learning (for the caseof a parametric transition model) and we assume that parameters arefixed

KMCF is based on probability representations provided by the frame-work of kernel mean embeddings a recent development in the field ofkernel methods (Smola Gretton Song amp Scholkopf 2007 SriperumbudurGretton Fukumizu Scholkopf amp Lanckriet 2010 Song Fukumizu amp Gret-ton 2013) In this framework any probability distribution is represented as auniquely associated function in a reproducing kernel Hilbert space (RKHS)which is known as a kernel mean This representation enables us to esti-mate a distribution of interest by alternatively estimating the correspondingkernel mean One significant feature of kernel mean embeddings is kernelBayesrsquo rule (Fukumizu Song amp Gretton 2011 2013) by which KMCF es-timates posteriors based on the state-observation examples Kernel Bayesrsquorule has the following properties First it is theoretically grounded andis proven to get more accurate as the number of the examples increasesSecond it requires neither parametric assumptions nor heuristic approxi-mations for the observation model Third similar to other kernel methodsin machine learning kernel Bayesrsquo rule is empirically known to performwell for high-dimensional data when compared to classical nonparametricmethods KMCF inherits these favorable properties

KMCF sequentially estimates the RKHS representation of the posterior(see equation 11) in the form of weighted samples This estimation consistsof three steps prediction correction and resampling Suppose that wealready obtained an estimate for the posterior of the previous time In theprediction step this previous estimate is propagated forward by sampling

386 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

with the transition model in the same manner as the sampling procedure of aparticle filter The propagated estimate is then used as a prior for the currentstate In the correction step kernel Bayesrsquo rule is applied to obtain a posteriorestimate using the prior and the state-observation examples (XiYi)n

i=1Finally in the resampling step an approximate version of kernel herding(Chen Welling amp Smola 2010) is applied to obtain pseudosamples fromthe posterior estimate Kernel herding is a greedy optimization method togenerate pseudosamples from a given kernel mean and searches for thosesamples from the entire space X Our resampling algorithm modifies thisand searches for pseudosamples from a finite candidate set of the statesamples X1 Xn sub X The obtained pseudosamples are then used inthe prediction step of the next iteration

While the KMCF algorithm is inspired by particle filters there are severalimportant differences First a weighted sample expression in KMCF is anestimator of the RKHS representation of a probability distribution whilethat of a particle filter represents an empirical distribution This differencecan be seen in the fact that weights of KMCF can take negative valueswhile weights of a particle filter are always positive Second to estimate aposterior KMCF uses the state-observation examples (XiYi)n

i=1 and doesnot require the observation model itself while a particle filter makes use ofthe observation model to update weights In other words KMCF involvesnonparametric estimation of the observation model while a particle filterdoes not Third KMCF achieves resampling based on kernel herding whilea particle filter uses a standard resampling procedure with an empiricaldistribution We use kernel herding because the resampling procedure ofparticle methods is not appropriate for KMCF as the weights in KMCF maytake negative values

Since the theory of particle methods cannot be used to justify our ap-proach we conduct the following theoretical analysis

bull We derive error bounds for the sampling procedure in the predictionstep in section 51 This justifies the use of the sampling procedurewith weighted sample expressions of kernel mean embeddings Thebounds are not trivial since the weights of kernel mean embeddingscan take negative values

bull We discuss how resampling works with kernel mean embeddings (seesection 52) It improves the estimation accuracy of the subsequentsampling procedure by increasing the effective sample size of anempirical kernel mean This mechanism is essentially the same asthat of a particle filter

bull We provide novel convergence rates of kernel herding when pseu-dosamples are searched from a finite candidate set (see section 53)This justifies our resampling algorithm This result may be of inde-pendent interest to the kernel community as it describes how kernelherding is often used in practice

Filtering with State-Observation Examples 387

bull We show the consistency of the overall filtering procedure of KMCFunder certain smoothness assumptions (see section 54) KMCFprovides consistent posterior estimates as the number of state-observation examples (XiYi)n

i=1 increases

The rest of the letter is organized as follows In section 2 we reviewrelated work Section 3 is devoted to preliminaries to make the letter self-contained we review the theory of kernel mean embeddings Section 4presents the kernel Monte Carlo filter and section 5 shows theoretical re-sults In section 6 we demonstrate the effectiveness of KMCF by artificialand real-data experiments The real experiment is on vision-based mobilerobot localization an example of the location estimation problems men-tioned above The appendixes present two methods for reducing KMCFcomputational costs

This letter expands on a conference paper by Kanagawa NishiyamaGretton and Fukumizu (2014) It differs from that earlier work in that itintroduces and justifies the use of kernel herding for resampling The re-sampling step allows us to control the effective sample size of an empiricalkernel mean an important factor that determines the accuracy of the sam-pling procedure as in particle methods

2 Related Work

We consider the following setting First the observation model p(yt |xt ) isnot known explicitly or even parametrically Instead state-observation ex-amples (XiYi) are available before the test phase Second sampling fromthe transition model p(xt |xtminus1) is possible Note that standard particle filterscannot be applied to this setting directly since they require the observationmodel to be given as a parametric model

As far as we know a few methods can be applied to this setting directly(Vlassis et al 2002 Ferris et al 2006) These methods learn the observationmodel from state-observation examples nonparametrically and then use itto run a particle filter with a transition model Vlassis et al (2002) proposedto apply conditional density estimation based on the k-nearest neighborsapproach (Stone 1977) for learning the observation model A problem hereis that conditional density estimation suffers from the curse of dimension-ality if observations are high-dimensional (Silverman 1986) Vlassis et al(2002) avoided this problem by estimating the conditional density functionof a state given an observation and used it as an alternative for the obser-vation model This heuristic may introduce bias in estimation howeverFerris et al (2006) proposed using gaussian process regression for learningthe observation model This method will perform well if the gaussian noiseassumption is satisfied but cannot be applied to structured observations

There exist related but different problem settings from ours One situa-tion is that examples for state transitions are also given and the transition

388 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

model is to be learned nonparametrically from these examples For this set-ting there are methods based on kernel mean embeddings (Song HuangSmola amp Fukumizu 2009 Fukumizu et al 2011 2013) and gaussian pro-cesses (Ko amp Fox 2009 Deisenroth Huber amp Hanebeck 2009) The filteringmethod by Fukumizu et al (2011 2013) is in particular closely related toKMCF as it also uses kernel Bayesrsquo rule A main difference from KMCF isthat it computes forward probabilities by kernel sum rule (Song et al 20092013) which nonparametrically learns the transition model from the statetransition examples While the setting is different from ours we compareKMCF with this method in our experiments as a baseline

Another related setting is that the observation model itself is given andsampling is possible but computation of its values is expensive or evenimpossible Therefore ordinary Bayesrsquo rule cannot be used for filtering Toovercome this limitation Jasra Singh Martin and McCoy (2012) and Calvetand Czellar (2015) proposed applying approximate Bayesian computation(ABC) methods For each iteration of filtering these methods generate state-observation pairs from the observation model Then they pick some pairsthat have close observations to the test observation and regard the statesin these pairs as samples from a posterior Note that these methods arenot applicable to our setting since we do not assume that the observationmodel is provided That said our method may be applied to their setting bygenerating state-observation examples from the observation model Whilesuch a comparison would be interesting this letter focuses on comparisonamong the methods applicable to our setting

3 Kernel Mean Embeddings of Distributions

Here we briefly review the framework of kernel mean embeddings Fordetails we refer to the tutorial papers (Smola et al 2007 Song et al 2013)

31 Positive-Definite Kernel and RKHS We begin by introducingpositive-definite kernels and reproducing kernel Hilbert spaces details ofwhich can be found in Scholkopf and Smola (2002) Berlinet and Thomas-Agnan (2004) and Steinwart and Christmann (2008)

Let X be a set and k X times X rarr R be a positive-definite (pd) kernel1

Any positive-definite kernel is uniquely associated with a reproducing ker-nel Hilbert space (RKHS) (Aronszajn 1950) Let H be the RKHS associatedwith k The RKHS H is a Hilbert space of functions on X that satisfies thefollowing important properties

1A symmetric kernel k X times X rarr R is called positive definite (pd) if for all n isin Nc1 cn isin R and X1 Xn isin X we have

nsumi=1

nsumj=1

cic jk(Xi Xj ) ge 0

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 3: Filtering with State-Observation Examples via Kernel Monte ...

384 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 1 Graphical representation of a state-space model y1 yT denoteobservations and x1 xT denote states The states are hidden and to beestimated from the observations

However it can be restrictive even to assume that the observation modelp(yt |xt ) is given as a probabilistic model An important point is that in prac-tice we may define the states x1 xT arbitrarily as quantities that we wishto estimate from available observations y1 yT Thus if these quantitiesare very different from the observations the observation model may notadmit a simple parametric form For example in location estimation prob-lems in robotics states are locations in a map while observations are sensordata such as camera images and signal strength measurements of a wire-less device (Vlassis Terwijn amp Krose 2002 Wolf Burgard amp Burkhardt2005 Ferris Hahnel amp Fox 2006) In brain-computer interface applicationsstates are defined as positions of a device to be manipulated while observa-tions are brain signals (Pistohl Ball Schulze-Bonhage Aertsen amp Mehring2008 Wang Ji Miller amp Schalk 2011) In these applications it is hard todefine the observation model as a probabilistic model in parametric form

For such applications where the observation model is very complicatedinformation about the relation between states and observations is given asexamples of state-observation pairs (XiYi) such examples are often avail-able before conducting filtering in test phase For example one can collectlocation-sensor examples for the location estimation problems by makinguse of more expensive sensors than those for filtering (Quigley StavensCoates amp Thrun 2010) The brain-computer interface problems also allowus to obtain training samples for the relation between device positions andbrain signals (Schalk et al 2007) However making use of such examplesfor learning the observation model is not straightforward If one relies ona parametric approach it would require exhaustive efforts for designinga parametric model to fit the complicated (true) observation model Non-parametric methods such as kernel density estimation (Silverman 1986)on the other hand suffer from the curse of dimensionality when applied tohigh-dimensional observations Moreover observations may be suitable tobe represented as structured (nonvectorial) data as for the cases of imageand text Such situations are not straightforward for either approach sincethey usually require that data is given as real vectors

11 Kernel Monte Carlo Filter In this letter we propose a filter-ing method that is focused on situations where the information of the

Filtering with State-Observation Examples 385

observation model p(yt |xt ) is given only through the state-observation ex-amples (XiYi) The proposed method which we call the kernel MonteCarlo filter (KMCF) is applicable when the following are satisfied

1 Positive-definite kernels (reproducing kernels) are defined on thestates and observations Roughly a positive-definite kernel is a sim-ilarity function that takes two data points as input and outputs theirsimilarity value

2 Sampling with the transition model p(xt |xtminus1) is possible This is thesame assumption as for standard particle filters the probabilisticmodel can be arbitrarily nonlinear and nongaussian

The past decades of research on kernel methods have yielded numerouskernels for real vectors and for structured data of various types (Scholkopfamp Smola 2002 Hofmann Scholkopf amp Smola 2008) Examples includekernels for images in computer vision (Lazebnik Schmid amp Ponce 2006)graph-structured data in bioinformatics (Scholkopf et al 2004) and ge-nomic sequences (Schaid 2010a 2010b) Therefore we can apply KMCFto such structured data by making use of the kernels developed in thesefields On the other hand this letter assumes that the transition modelis given explicitly we do not discuss parameter learning (for the caseof a parametric transition model) and we assume that parameters arefixed

KMCF is based on probability representations provided by the frame-work of kernel mean embeddings a recent development in the field ofkernel methods (Smola Gretton Song amp Scholkopf 2007 SriperumbudurGretton Fukumizu Scholkopf amp Lanckriet 2010 Song Fukumizu amp Gret-ton 2013) In this framework any probability distribution is represented as auniquely associated function in a reproducing kernel Hilbert space (RKHS)which is known as a kernel mean This representation enables us to esti-mate a distribution of interest by alternatively estimating the correspondingkernel mean One significant feature of kernel mean embeddings is kernelBayesrsquo rule (Fukumizu Song amp Gretton 2011 2013) by which KMCF es-timates posteriors based on the state-observation examples Kernel Bayesrsquorule has the following properties First it is theoretically grounded andis proven to get more accurate as the number of the examples increasesSecond it requires neither parametric assumptions nor heuristic approxi-mations for the observation model Third similar to other kernel methodsin machine learning kernel Bayesrsquo rule is empirically known to performwell for high-dimensional data when compared to classical nonparametricmethods KMCF inherits these favorable properties

KMCF sequentially estimates the RKHS representation of the posterior(see equation 11) in the form of weighted samples This estimation consistsof three steps prediction correction and resampling Suppose that wealready obtained an estimate for the posterior of the previous time In theprediction step this previous estimate is propagated forward by sampling

386 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

with the transition model in the same manner as the sampling procedure of aparticle filter The propagated estimate is then used as a prior for the currentstate In the correction step kernel Bayesrsquo rule is applied to obtain a posteriorestimate using the prior and the state-observation examples (XiYi)n

i=1Finally in the resampling step an approximate version of kernel herding(Chen Welling amp Smola 2010) is applied to obtain pseudosamples fromthe posterior estimate Kernel herding is a greedy optimization method togenerate pseudosamples from a given kernel mean and searches for thosesamples from the entire space X Our resampling algorithm modifies thisand searches for pseudosamples from a finite candidate set of the statesamples X1 Xn sub X The obtained pseudosamples are then used inthe prediction step of the next iteration

While the KMCF algorithm is inspired by particle filters there are severalimportant differences First a weighted sample expression in KMCF is anestimator of the RKHS representation of a probability distribution whilethat of a particle filter represents an empirical distribution This differencecan be seen in the fact that weights of KMCF can take negative valueswhile weights of a particle filter are always positive Second to estimate aposterior KMCF uses the state-observation examples (XiYi)n

i=1 and doesnot require the observation model itself while a particle filter makes use ofthe observation model to update weights In other words KMCF involvesnonparametric estimation of the observation model while a particle filterdoes not Third KMCF achieves resampling based on kernel herding whilea particle filter uses a standard resampling procedure with an empiricaldistribution We use kernel herding because the resampling procedure ofparticle methods is not appropriate for KMCF as the weights in KMCF maytake negative values

Since the theory of particle methods cannot be used to justify our ap-proach we conduct the following theoretical analysis

bull We derive error bounds for the sampling procedure in the predictionstep in section 51 This justifies the use of the sampling procedurewith weighted sample expressions of kernel mean embeddings Thebounds are not trivial since the weights of kernel mean embeddingscan take negative values

bull We discuss how resampling works with kernel mean embeddings (seesection 52) It improves the estimation accuracy of the subsequentsampling procedure by increasing the effective sample size of anempirical kernel mean This mechanism is essentially the same asthat of a particle filter

bull We provide novel convergence rates of kernel herding when pseu-dosamples are searched from a finite candidate set (see section 53)This justifies our resampling algorithm This result may be of inde-pendent interest to the kernel community as it describes how kernelherding is often used in practice

Filtering with State-Observation Examples 387

bull We show the consistency of the overall filtering procedure of KMCFunder certain smoothness assumptions (see section 54) KMCFprovides consistent posterior estimates as the number of state-observation examples (XiYi)n

i=1 increases

The rest of the letter is organized as follows In section 2 we reviewrelated work Section 3 is devoted to preliminaries to make the letter self-contained we review the theory of kernel mean embeddings Section 4presents the kernel Monte Carlo filter and section 5 shows theoretical re-sults In section 6 we demonstrate the effectiveness of KMCF by artificialand real-data experiments The real experiment is on vision-based mobilerobot localization an example of the location estimation problems men-tioned above The appendixes present two methods for reducing KMCFcomputational costs

This letter expands on a conference paper by Kanagawa NishiyamaGretton and Fukumizu (2014) It differs from that earlier work in that itintroduces and justifies the use of kernel herding for resampling The re-sampling step allows us to control the effective sample size of an empiricalkernel mean an important factor that determines the accuracy of the sam-pling procedure as in particle methods

2 Related Work

We consider the following setting First the observation model p(yt |xt ) isnot known explicitly or even parametrically Instead state-observation ex-amples (XiYi) are available before the test phase Second sampling fromthe transition model p(xt |xtminus1) is possible Note that standard particle filterscannot be applied to this setting directly since they require the observationmodel to be given as a parametric model

As far as we know a few methods can be applied to this setting directly(Vlassis et al 2002 Ferris et al 2006) These methods learn the observationmodel from state-observation examples nonparametrically and then use itto run a particle filter with a transition model Vlassis et al (2002) proposedto apply conditional density estimation based on the k-nearest neighborsapproach (Stone 1977) for learning the observation model A problem hereis that conditional density estimation suffers from the curse of dimension-ality if observations are high-dimensional (Silverman 1986) Vlassis et al(2002) avoided this problem by estimating the conditional density functionof a state given an observation and used it as an alternative for the obser-vation model This heuristic may introduce bias in estimation howeverFerris et al (2006) proposed using gaussian process regression for learningthe observation model This method will perform well if the gaussian noiseassumption is satisfied but cannot be applied to structured observations

There exist related but different problem settings from ours One situa-tion is that examples for state transitions are also given and the transition

388 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

model is to be learned nonparametrically from these examples For this set-ting there are methods based on kernel mean embeddings (Song HuangSmola amp Fukumizu 2009 Fukumizu et al 2011 2013) and gaussian pro-cesses (Ko amp Fox 2009 Deisenroth Huber amp Hanebeck 2009) The filteringmethod by Fukumizu et al (2011 2013) is in particular closely related toKMCF as it also uses kernel Bayesrsquo rule A main difference from KMCF isthat it computes forward probabilities by kernel sum rule (Song et al 20092013) which nonparametrically learns the transition model from the statetransition examples While the setting is different from ours we compareKMCF with this method in our experiments as a baseline

Another related setting is that the observation model itself is given andsampling is possible but computation of its values is expensive or evenimpossible Therefore ordinary Bayesrsquo rule cannot be used for filtering Toovercome this limitation Jasra Singh Martin and McCoy (2012) and Calvetand Czellar (2015) proposed applying approximate Bayesian computation(ABC) methods For each iteration of filtering these methods generate state-observation pairs from the observation model Then they pick some pairsthat have close observations to the test observation and regard the statesin these pairs as samples from a posterior Note that these methods arenot applicable to our setting since we do not assume that the observationmodel is provided That said our method may be applied to their setting bygenerating state-observation examples from the observation model Whilesuch a comparison would be interesting this letter focuses on comparisonamong the methods applicable to our setting

3 Kernel Mean Embeddings of Distributions

Here we briefly review the framework of kernel mean embeddings Fordetails we refer to the tutorial papers (Smola et al 2007 Song et al 2013)

31 Positive-Definite Kernel and RKHS We begin by introducingpositive-definite kernels and reproducing kernel Hilbert spaces details ofwhich can be found in Scholkopf and Smola (2002) Berlinet and Thomas-Agnan (2004) and Steinwart and Christmann (2008)

Let X be a set and k X times X rarr R be a positive-definite (pd) kernel1

Any positive-definite kernel is uniquely associated with a reproducing ker-nel Hilbert space (RKHS) (Aronszajn 1950) Let H be the RKHS associatedwith k The RKHS H is a Hilbert space of functions on X that satisfies thefollowing important properties

1A symmetric kernel k X times X rarr R is called positive definite (pd) if for all n isin Nc1 cn isin R and X1 Xn isin X we have

nsumi=1

nsumj=1

cic jk(Xi Xj ) ge 0

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 4: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 385

observation model p(yt |xt ) is given only through the state-observation ex-amples (XiYi) The proposed method which we call the kernel MonteCarlo filter (KMCF) is applicable when the following are satisfied

1 Positive-definite kernels (reproducing kernels) are defined on thestates and observations Roughly a positive-definite kernel is a sim-ilarity function that takes two data points as input and outputs theirsimilarity value

2 Sampling with the transition model p(xt |xtminus1) is possible This is thesame assumption as for standard particle filters the probabilisticmodel can be arbitrarily nonlinear and nongaussian

The past decades of research on kernel methods have yielded numerouskernels for real vectors and for structured data of various types (Scholkopfamp Smola 2002 Hofmann Scholkopf amp Smola 2008) Examples includekernels for images in computer vision (Lazebnik Schmid amp Ponce 2006)graph-structured data in bioinformatics (Scholkopf et al 2004) and ge-nomic sequences (Schaid 2010a 2010b) Therefore we can apply KMCFto such structured data by making use of the kernels developed in thesefields On the other hand this letter assumes that the transition modelis given explicitly we do not discuss parameter learning (for the caseof a parametric transition model) and we assume that parameters arefixed

KMCF is based on probability representations provided by the frame-work of kernel mean embeddings a recent development in the field ofkernel methods (Smola Gretton Song amp Scholkopf 2007 SriperumbudurGretton Fukumizu Scholkopf amp Lanckriet 2010 Song Fukumizu amp Gret-ton 2013) In this framework any probability distribution is represented as auniquely associated function in a reproducing kernel Hilbert space (RKHS)which is known as a kernel mean This representation enables us to esti-mate a distribution of interest by alternatively estimating the correspondingkernel mean One significant feature of kernel mean embeddings is kernelBayesrsquo rule (Fukumizu Song amp Gretton 2011 2013) by which KMCF es-timates posteriors based on the state-observation examples Kernel Bayesrsquorule has the following properties First it is theoretically grounded andis proven to get more accurate as the number of the examples increasesSecond it requires neither parametric assumptions nor heuristic approxi-mations for the observation model Third similar to other kernel methodsin machine learning kernel Bayesrsquo rule is empirically known to performwell for high-dimensional data when compared to classical nonparametricmethods KMCF inherits these favorable properties

KMCF sequentially estimates the RKHS representation of the posterior(see equation 11) in the form of weighted samples This estimation consistsof three steps prediction correction and resampling Suppose that wealready obtained an estimate for the posterior of the previous time In theprediction step this previous estimate is propagated forward by sampling

386 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

with the transition model in the same manner as the sampling procedure of aparticle filter The propagated estimate is then used as a prior for the currentstate In the correction step kernel Bayesrsquo rule is applied to obtain a posteriorestimate using the prior and the state-observation examples (XiYi)n

i=1Finally in the resampling step an approximate version of kernel herding(Chen Welling amp Smola 2010) is applied to obtain pseudosamples fromthe posterior estimate Kernel herding is a greedy optimization method togenerate pseudosamples from a given kernel mean and searches for thosesamples from the entire space X Our resampling algorithm modifies thisand searches for pseudosamples from a finite candidate set of the statesamples X1 Xn sub X The obtained pseudosamples are then used inthe prediction step of the next iteration

While the KMCF algorithm is inspired by particle filters there are severalimportant differences First a weighted sample expression in KMCF is anestimator of the RKHS representation of a probability distribution whilethat of a particle filter represents an empirical distribution This differencecan be seen in the fact that weights of KMCF can take negative valueswhile weights of a particle filter are always positive Second to estimate aposterior KMCF uses the state-observation examples (XiYi)n

i=1 and doesnot require the observation model itself while a particle filter makes use ofthe observation model to update weights In other words KMCF involvesnonparametric estimation of the observation model while a particle filterdoes not Third KMCF achieves resampling based on kernel herding whilea particle filter uses a standard resampling procedure with an empiricaldistribution We use kernel herding because the resampling procedure ofparticle methods is not appropriate for KMCF as the weights in KMCF maytake negative values

Since the theory of particle methods cannot be used to justify our ap-proach we conduct the following theoretical analysis

bull We derive error bounds for the sampling procedure in the predictionstep in section 51 This justifies the use of the sampling procedurewith weighted sample expressions of kernel mean embeddings Thebounds are not trivial since the weights of kernel mean embeddingscan take negative values

bull We discuss how resampling works with kernel mean embeddings (seesection 52) It improves the estimation accuracy of the subsequentsampling procedure by increasing the effective sample size of anempirical kernel mean This mechanism is essentially the same asthat of a particle filter

bull We provide novel convergence rates of kernel herding when pseu-dosamples are searched from a finite candidate set (see section 53)This justifies our resampling algorithm This result may be of inde-pendent interest to the kernel community as it describes how kernelherding is often used in practice

Filtering with State-Observation Examples 387

bull We show the consistency of the overall filtering procedure of KMCFunder certain smoothness assumptions (see section 54) KMCFprovides consistent posterior estimates as the number of state-observation examples (XiYi)n

i=1 increases

The rest of the letter is organized as follows In section 2 we reviewrelated work Section 3 is devoted to preliminaries to make the letter self-contained we review the theory of kernel mean embeddings Section 4presents the kernel Monte Carlo filter and section 5 shows theoretical re-sults In section 6 we demonstrate the effectiveness of KMCF by artificialand real-data experiments The real experiment is on vision-based mobilerobot localization an example of the location estimation problems men-tioned above The appendixes present two methods for reducing KMCFcomputational costs

This letter expands on a conference paper by Kanagawa NishiyamaGretton and Fukumizu (2014) It differs from that earlier work in that itintroduces and justifies the use of kernel herding for resampling The re-sampling step allows us to control the effective sample size of an empiricalkernel mean an important factor that determines the accuracy of the sam-pling procedure as in particle methods

2 Related Work

We consider the following setting First the observation model p(yt |xt ) isnot known explicitly or even parametrically Instead state-observation ex-amples (XiYi) are available before the test phase Second sampling fromthe transition model p(xt |xtminus1) is possible Note that standard particle filterscannot be applied to this setting directly since they require the observationmodel to be given as a parametric model

As far as we know a few methods can be applied to this setting directly(Vlassis et al 2002 Ferris et al 2006) These methods learn the observationmodel from state-observation examples nonparametrically and then use itto run a particle filter with a transition model Vlassis et al (2002) proposedto apply conditional density estimation based on the k-nearest neighborsapproach (Stone 1977) for learning the observation model A problem hereis that conditional density estimation suffers from the curse of dimension-ality if observations are high-dimensional (Silverman 1986) Vlassis et al(2002) avoided this problem by estimating the conditional density functionof a state given an observation and used it as an alternative for the obser-vation model This heuristic may introduce bias in estimation howeverFerris et al (2006) proposed using gaussian process regression for learningthe observation model This method will perform well if the gaussian noiseassumption is satisfied but cannot be applied to structured observations

There exist related but different problem settings from ours One situa-tion is that examples for state transitions are also given and the transition

388 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

model is to be learned nonparametrically from these examples For this set-ting there are methods based on kernel mean embeddings (Song HuangSmola amp Fukumizu 2009 Fukumizu et al 2011 2013) and gaussian pro-cesses (Ko amp Fox 2009 Deisenroth Huber amp Hanebeck 2009) The filteringmethod by Fukumizu et al (2011 2013) is in particular closely related toKMCF as it also uses kernel Bayesrsquo rule A main difference from KMCF isthat it computes forward probabilities by kernel sum rule (Song et al 20092013) which nonparametrically learns the transition model from the statetransition examples While the setting is different from ours we compareKMCF with this method in our experiments as a baseline

Another related setting is that the observation model itself is given andsampling is possible but computation of its values is expensive or evenimpossible Therefore ordinary Bayesrsquo rule cannot be used for filtering Toovercome this limitation Jasra Singh Martin and McCoy (2012) and Calvetand Czellar (2015) proposed applying approximate Bayesian computation(ABC) methods For each iteration of filtering these methods generate state-observation pairs from the observation model Then they pick some pairsthat have close observations to the test observation and regard the statesin these pairs as samples from a posterior Note that these methods arenot applicable to our setting since we do not assume that the observationmodel is provided That said our method may be applied to their setting bygenerating state-observation examples from the observation model Whilesuch a comparison would be interesting this letter focuses on comparisonamong the methods applicable to our setting

3 Kernel Mean Embeddings of Distributions

Here we briefly review the framework of kernel mean embeddings Fordetails we refer to the tutorial papers (Smola et al 2007 Song et al 2013)

31 Positive-Definite Kernel and RKHS We begin by introducingpositive-definite kernels and reproducing kernel Hilbert spaces details ofwhich can be found in Scholkopf and Smola (2002) Berlinet and Thomas-Agnan (2004) and Steinwart and Christmann (2008)

Let X be a set and k X times X rarr R be a positive-definite (pd) kernel1

Any positive-definite kernel is uniquely associated with a reproducing ker-nel Hilbert space (RKHS) (Aronszajn 1950) Let H be the RKHS associatedwith k The RKHS H is a Hilbert space of functions on X that satisfies thefollowing important properties

1A symmetric kernel k X times X rarr R is called positive definite (pd) if for all n isin Nc1 cn isin R and X1 Xn isin X we have

nsumi=1

nsumj=1

cic jk(Xi Xj ) ge 0

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 5: Filtering with State-Observation Examples via Kernel Monte ...

386 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

with the transition model in the same manner as the sampling procedure of aparticle filter The propagated estimate is then used as a prior for the currentstate In the correction step kernel Bayesrsquo rule is applied to obtain a posteriorestimate using the prior and the state-observation examples (XiYi)n

i=1Finally in the resampling step an approximate version of kernel herding(Chen Welling amp Smola 2010) is applied to obtain pseudosamples fromthe posterior estimate Kernel herding is a greedy optimization method togenerate pseudosamples from a given kernel mean and searches for thosesamples from the entire space X Our resampling algorithm modifies thisand searches for pseudosamples from a finite candidate set of the statesamples X1 Xn sub X The obtained pseudosamples are then used inthe prediction step of the next iteration

While the KMCF algorithm is inspired by particle filters there are severalimportant differences First a weighted sample expression in KMCF is anestimator of the RKHS representation of a probability distribution whilethat of a particle filter represents an empirical distribution This differencecan be seen in the fact that weights of KMCF can take negative valueswhile weights of a particle filter are always positive Second to estimate aposterior KMCF uses the state-observation examples (XiYi)n

i=1 and doesnot require the observation model itself while a particle filter makes use ofthe observation model to update weights In other words KMCF involvesnonparametric estimation of the observation model while a particle filterdoes not Third KMCF achieves resampling based on kernel herding whilea particle filter uses a standard resampling procedure with an empiricaldistribution We use kernel herding because the resampling procedure ofparticle methods is not appropriate for KMCF as the weights in KMCF maytake negative values

Since the theory of particle methods cannot be used to justify our ap-proach we conduct the following theoretical analysis

bull We derive error bounds for the sampling procedure in the predictionstep in section 51 This justifies the use of the sampling procedurewith weighted sample expressions of kernel mean embeddings Thebounds are not trivial since the weights of kernel mean embeddingscan take negative values

bull We discuss how resampling works with kernel mean embeddings (seesection 52) It improves the estimation accuracy of the subsequentsampling procedure by increasing the effective sample size of anempirical kernel mean This mechanism is essentially the same asthat of a particle filter

bull We provide novel convergence rates of kernel herding when pseu-dosamples are searched from a finite candidate set (see section 53)This justifies our resampling algorithm This result may be of inde-pendent interest to the kernel community as it describes how kernelherding is often used in practice

Filtering with State-Observation Examples 387

bull We show the consistency of the overall filtering procedure of KMCFunder certain smoothness assumptions (see section 54) KMCFprovides consistent posterior estimates as the number of state-observation examples (XiYi)n

i=1 increases

The rest of the letter is organized as follows In section 2 we reviewrelated work Section 3 is devoted to preliminaries to make the letter self-contained we review the theory of kernel mean embeddings Section 4presents the kernel Monte Carlo filter and section 5 shows theoretical re-sults In section 6 we demonstrate the effectiveness of KMCF by artificialand real-data experiments The real experiment is on vision-based mobilerobot localization an example of the location estimation problems men-tioned above The appendixes present two methods for reducing KMCFcomputational costs

This letter expands on a conference paper by Kanagawa NishiyamaGretton and Fukumizu (2014) It differs from that earlier work in that itintroduces and justifies the use of kernel herding for resampling The re-sampling step allows us to control the effective sample size of an empiricalkernel mean an important factor that determines the accuracy of the sam-pling procedure as in particle methods

2 Related Work

We consider the following setting First the observation model p(yt |xt ) isnot known explicitly or even parametrically Instead state-observation ex-amples (XiYi) are available before the test phase Second sampling fromthe transition model p(xt |xtminus1) is possible Note that standard particle filterscannot be applied to this setting directly since they require the observationmodel to be given as a parametric model

As far as we know a few methods can be applied to this setting directly(Vlassis et al 2002 Ferris et al 2006) These methods learn the observationmodel from state-observation examples nonparametrically and then use itto run a particle filter with a transition model Vlassis et al (2002) proposedto apply conditional density estimation based on the k-nearest neighborsapproach (Stone 1977) for learning the observation model A problem hereis that conditional density estimation suffers from the curse of dimension-ality if observations are high-dimensional (Silverman 1986) Vlassis et al(2002) avoided this problem by estimating the conditional density functionof a state given an observation and used it as an alternative for the obser-vation model This heuristic may introduce bias in estimation howeverFerris et al (2006) proposed using gaussian process regression for learningthe observation model This method will perform well if the gaussian noiseassumption is satisfied but cannot be applied to structured observations

There exist related but different problem settings from ours One situa-tion is that examples for state transitions are also given and the transition

388 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

model is to be learned nonparametrically from these examples For this set-ting there are methods based on kernel mean embeddings (Song HuangSmola amp Fukumizu 2009 Fukumizu et al 2011 2013) and gaussian pro-cesses (Ko amp Fox 2009 Deisenroth Huber amp Hanebeck 2009) The filteringmethod by Fukumizu et al (2011 2013) is in particular closely related toKMCF as it also uses kernel Bayesrsquo rule A main difference from KMCF isthat it computes forward probabilities by kernel sum rule (Song et al 20092013) which nonparametrically learns the transition model from the statetransition examples While the setting is different from ours we compareKMCF with this method in our experiments as a baseline

Another related setting is that the observation model itself is given andsampling is possible but computation of its values is expensive or evenimpossible Therefore ordinary Bayesrsquo rule cannot be used for filtering Toovercome this limitation Jasra Singh Martin and McCoy (2012) and Calvetand Czellar (2015) proposed applying approximate Bayesian computation(ABC) methods For each iteration of filtering these methods generate state-observation pairs from the observation model Then they pick some pairsthat have close observations to the test observation and regard the statesin these pairs as samples from a posterior Note that these methods arenot applicable to our setting since we do not assume that the observationmodel is provided That said our method may be applied to their setting bygenerating state-observation examples from the observation model Whilesuch a comparison would be interesting this letter focuses on comparisonamong the methods applicable to our setting

3 Kernel Mean Embeddings of Distributions

Here we briefly review the framework of kernel mean embeddings Fordetails we refer to the tutorial papers (Smola et al 2007 Song et al 2013)

31 Positive-Definite Kernel and RKHS We begin by introducingpositive-definite kernels and reproducing kernel Hilbert spaces details ofwhich can be found in Scholkopf and Smola (2002) Berlinet and Thomas-Agnan (2004) and Steinwart and Christmann (2008)

Let X be a set and k X times X rarr R be a positive-definite (pd) kernel1

Any positive-definite kernel is uniquely associated with a reproducing ker-nel Hilbert space (RKHS) (Aronszajn 1950) Let H be the RKHS associatedwith k The RKHS H is a Hilbert space of functions on X that satisfies thefollowing important properties

1A symmetric kernel k X times X rarr R is called positive definite (pd) if for all n isin Nc1 cn isin R and X1 Xn isin X we have

nsumi=1

nsumj=1

cic jk(Xi Xj ) ge 0

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 6: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 387

bull We show the consistency of the overall filtering procedure of KMCFunder certain smoothness assumptions (see section 54) KMCFprovides consistent posterior estimates as the number of state-observation examples (XiYi)n

i=1 increases

The rest of the letter is organized as follows In section 2 we reviewrelated work Section 3 is devoted to preliminaries to make the letter self-contained we review the theory of kernel mean embeddings Section 4presents the kernel Monte Carlo filter and section 5 shows theoretical re-sults In section 6 we demonstrate the effectiveness of KMCF by artificialand real-data experiments The real experiment is on vision-based mobilerobot localization an example of the location estimation problems men-tioned above The appendixes present two methods for reducing KMCFcomputational costs

This letter expands on a conference paper by Kanagawa NishiyamaGretton and Fukumizu (2014) It differs from that earlier work in that itintroduces and justifies the use of kernel herding for resampling The re-sampling step allows us to control the effective sample size of an empiricalkernel mean an important factor that determines the accuracy of the sam-pling procedure as in particle methods

2 Related Work

We consider the following setting First the observation model p(yt |xt ) isnot known explicitly or even parametrically Instead state-observation ex-amples (XiYi) are available before the test phase Second sampling fromthe transition model p(xt |xtminus1) is possible Note that standard particle filterscannot be applied to this setting directly since they require the observationmodel to be given as a parametric model

As far as we know a few methods can be applied to this setting directly(Vlassis et al 2002 Ferris et al 2006) These methods learn the observationmodel from state-observation examples nonparametrically and then use itto run a particle filter with a transition model Vlassis et al (2002) proposedto apply conditional density estimation based on the k-nearest neighborsapproach (Stone 1977) for learning the observation model A problem hereis that conditional density estimation suffers from the curse of dimension-ality if observations are high-dimensional (Silverman 1986) Vlassis et al(2002) avoided this problem by estimating the conditional density functionof a state given an observation and used it as an alternative for the obser-vation model This heuristic may introduce bias in estimation howeverFerris et al (2006) proposed using gaussian process regression for learningthe observation model This method will perform well if the gaussian noiseassumption is satisfied but cannot be applied to structured observations

There exist related but different problem settings from ours One situa-tion is that examples for state transitions are also given and the transition

388 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

model is to be learned nonparametrically from these examples For this set-ting there are methods based on kernel mean embeddings (Song HuangSmola amp Fukumizu 2009 Fukumizu et al 2011 2013) and gaussian pro-cesses (Ko amp Fox 2009 Deisenroth Huber amp Hanebeck 2009) The filteringmethod by Fukumizu et al (2011 2013) is in particular closely related toKMCF as it also uses kernel Bayesrsquo rule A main difference from KMCF isthat it computes forward probabilities by kernel sum rule (Song et al 20092013) which nonparametrically learns the transition model from the statetransition examples While the setting is different from ours we compareKMCF with this method in our experiments as a baseline

Another related setting is that the observation model itself is given andsampling is possible but computation of its values is expensive or evenimpossible Therefore ordinary Bayesrsquo rule cannot be used for filtering Toovercome this limitation Jasra Singh Martin and McCoy (2012) and Calvetand Czellar (2015) proposed applying approximate Bayesian computation(ABC) methods For each iteration of filtering these methods generate state-observation pairs from the observation model Then they pick some pairsthat have close observations to the test observation and regard the statesin these pairs as samples from a posterior Note that these methods arenot applicable to our setting since we do not assume that the observationmodel is provided That said our method may be applied to their setting bygenerating state-observation examples from the observation model Whilesuch a comparison would be interesting this letter focuses on comparisonamong the methods applicable to our setting

3 Kernel Mean Embeddings of Distributions

Here we briefly review the framework of kernel mean embeddings Fordetails we refer to the tutorial papers (Smola et al 2007 Song et al 2013)

31 Positive-Definite Kernel and RKHS We begin by introducingpositive-definite kernels and reproducing kernel Hilbert spaces details ofwhich can be found in Scholkopf and Smola (2002) Berlinet and Thomas-Agnan (2004) and Steinwart and Christmann (2008)

Let X be a set and k X times X rarr R be a positive-definite (pd) kernel1

Any positive-definite kernel is uniquely associated with a reproducing ker-nel Hilbert space (RKHS) (Aronszajn 1950) Let H be the RKHS associatedwith k The RKHS H is a Hilbert space of functions on X that satisfies thefollowing important properties

1A symmetric kernel k X times X rarr R is called positive definite (pd) if for all n isin Nc1 cn isin R and X1 Xn isin X we have

nsumi=1

nsumj=1

cic jk(Xi Xj ) ge 0

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 7: Filtering with State-Observation Examples via Kernel Monte ...

388 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

model is to be learned nonparametrically from these examples For this set-ting there are methods based on kernel mean embeddings (Song HuangSmola amp Fukumizu 2009 Fukumizu et al 2011 2013) and gaussian pro-cesses (Ko amp Fox 2009 Deisenroth Huber amp Hanebeck 2009) The filteringmethod by Fukumizu et al (2011 2013) is in particular closely related toKMCF as it also uses kernel Bayesrsquo rule A main difference from KMCF isthat it computes forward probabilities by kernel sum rule (Song et al 20092013) which nonparametrically learns the transition model from the statetransition examples While the setting is different from ours we compareKMCF with this method in our experiments as a baseline

Another related setting is that the observation model itself is given andsampling is possible but computation of its values is expensive or evenimpossible Therefore ordinary Bayesrsquo rule cannot be used for filtering Toovercome this limitation Jasra Singh Martin and McCoy (2012) and Calvetand Czellar (2015) proposed applying approximate Bayesian computation(ABC) methods For each iteration of filtering these methods generate state-observation pairs from the observation model Then they pick some pairsthat have close observations to the test observation and regard the statesin these pairs as samples from a posterior Note that these methods arenot applicable to our setting since we do not assume that the observationmodel is provided That said our method may be applied to their setting bygenerating state-observation examples from the observation model Whilesuch a comparison would be interesting this letter focuses on comparisonamong the methods applicable to our setting

3 Kernel Mean Embeddings of Distributions

Here we briefly review the framework of kernel mean embeddings Fordetails we refer to the tutorial papers (Smola et al 2007 Song et al 2013)

31 Positive-Definite Kernel and RKHS We begin by introducingpositive-definite kernels and reproducing kernel Hilbert spaces details ofwhich can be found in Scholkopf and Smola (2002) Berlinet and Thomas-Agnan (2004) and Steinwart and Christmann (2008)

Let X be a set and k X times X rarr R be a positive-definite (pd) kernel1

Any positive-definite kernel is uniquely associated with a reproducing ker-nel Hilbert space (RKHS) (Aronszajn 1950) Let H be the RKHS associatedwith k The RKHS H is a Hilbert space of functions on X that satisfies thefollowing important properties

1A symmetric kernel k X times X rarr R is called positive definite (pd) if for all n isin Nc1 cn isin R and X1 Xn isin X we have

nsumi=1

nsumj=1

cic jk(Xi Xj ) ge 0

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 8: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 389

bull Feature vector k(middot x) isin H for all x isin X bull Reproducing property f (x) = 〈 f k(middot x)〉H for all f isin H and x isin X

where 〈middot middot〉H denotes the inner product equipped with H and k(middot x) is afunction with x fixed By the reproducing property we have

k(x xprime) = 〈k(middot x) k(middot xprime)〉H forallx xprime isin X

Namely k(x xprime) implicitly computes the inner product between the func-tions k(middot x) and k(middot xprime) From this property k(middot x) can be seen as an implicitrepresentation of x in H Therefore k(middot x) is called the feature vector of xand H the feature space It is also known that the subspace spanned by thefeature vectors k(middot x)|x isin X is dense in H This means that any function fin H can be written as the limit of functions of the form fn = sumn

i=1 cik(middot Xi)where c1 cn isin R and X1 Xn isin X

For example positive-definite kernels on the Euclidean space X = Rd

include gaussian kernel k(x xprime) = exp(minusx minus xprime222σ 2) and Laplace kernel

k(x xprime) = exp(minusx minus x1σ ) where σ gt 0 and middot 1 denotes the 1 normNotably kernel methods allow X to be a set of structured data such asimages texts or graphs In fact there exist various positive-definite kernelsdeveloped for such structured data (Hofmann et al 2008) Note that thenotion of positive-definite kernels is different from smoothing kernels inkernel density estimation (Silverman 1986) a smoothing kernel does notnecessarily define an RKHS

32 Kernel Means We use the kernel k and the RKHS H to representprobability distributions onX This is the framework of kernel mean embed-dings (Smola et al 2007) Let X be a measurable space and k be measurableand bounded on X 2 Let P be an arbitrary probability distribution on X Then the representation of P in H is defined as the mean of the featurevector

mP =int

k(middot x)dP(x) isin H (31)

which is called the kernel mean of PIf k is characteristic the kernel mean equation 31 preserves all the in-

formation about P a positive-definite kernel k is defined to be characteristicif the mapping P rarr mP isin H is one-to-one (Fukumizu Bach amp Jordan 2004Fukumizu Gretton Sun amp Scholkopf 2008 Sriperumbudur et al 2010)This means that the RKHS is rich enough to distinguish among all distribu-tions For example the gaussian and Laplace kernels are characteristic (Forconditions for kernels to be characteristic see Fukumizu Sriperumbudur

2k is bounded on X if supxisinX k(x x) lt infin

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 9: Filtering with State-Observation Examples via Kernel Monte ...

390 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Gretton amp Scholkopf 2009 and Sriperumbudur et al 2010) We assumehenceforth that kernels are characteristic

An important property of the kernel mean equation 31 is the followingby the reproducing property we have

〈mP f 〉H =int

f (x)dP(x) = EXsimP[ f (X)] forall f isin H (32)

that is the expectation of any function in the RKHS can be given by theinner product between the kernel mean and that function

33 Estimation of Kernel Means Suppose that distribution P is un-known and that we wish to estimate P from available samples This can beequivalently done by estimating its kernel mean mP since mP preserves allthe information about P

For example let X1 Xn be an independent and identically distributed(iid) sample from P Define an estimator of mP by the empirical mean

mP = 1n

nsumi=1

k(middot Xi)

Then this converges to mP at a rate mP minus mPH = Op(nminus12) (Smola et al

2007) where Op denotes the asymptotic order in probability and middot H isthe norm of the RKHS fH = radic〈 f f 〉H for all f isin H Note that this rateis independent of the dimensionality of the space X

Next we explain kernel Bayesrsquo rule which serves as a building block ofour filtering algorithm To this end we introduce two measurable spacesX and Y Let p(xy) be a joint probability on the product space X times Y thatdecomposes as p(x y) = p(y|x)p(x) Let π(x) be a prior distribution onX Then the conditional probability p(y|x) and the prior π(x) define theposterior distribution by Bayesrsquo rule

pπ (x|y) prop p(y|x)π(x)

The assumption here is that the conditional probability p(y|x) is un-known Instead we are given an iid sample (X1Y1) (XnYn) fromthe joint probability p(xy) We wish to estimate the posterior pπ (x|y) usingthe sample KBR achieves this by estimating the kernel mean of pπ (x|y)

KBR requires that kernels be defined on X and Y Let kX and kY bekernels on X and Y respectively Define the kernel means of the prior π(x)

and the posterior pπ (x|y)

mπ =int

kX (middot x)π(x)dx mπX|y =

intkX (middot x)pπ (x|y)dx

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 10: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 391

KBR also requires that mπ be expressed as a weighted sample Let mπ =sumj=1 γ jkX (middotUj) be a sample expression of mπ where isin N γ1 γ isin R

and U1 U isin X For example suppose U1 U are iid drawn fromπ(x) Then γ j = 1 suffices

Given the joint sample (XiYi)ni=1 and the empirical prior mean mπ

KBR estimates the kernel posterior mean mπX|y as a weighted sum of the

feature vectors

mπX|y =

nsumi=1

wikX (middot Xi) (33)

where the weights w = (w1 wn)T isin Rn are given by algorithm 1 Here

diag(v) for v isin Rn denotes a diagonal matrix with diagonal entries v It takes

as input (1) vectors kY = (kY (yY1) kY (yYn))T mπ = (mπ (X1)

mπ (Xn))T isin Rn where mπ (Xi) = sum

j=1 γ jkX (XiUj) (2) kernel matricesGX = (kX (Xi Xj)) GY = (kY (YiYj )) isin R

ntimesn and (3) regularization con-stants ε δ gt 0 The weight vector w = (w1 wn)T isin R

n is obtained bymatrix computations involving two regularized matrix inversions Notethat these weights can be negative

Fukumizu et al (2013) showed that KBR is a consistent estimator of thekernel posterior mean under certain smoothness assumptions the estimateequation 33 converges to mπ

X|y as the sample size goes to infinity n rarr infinand mπ converges to mπ (with ε δ rarr 0 in appropriate speed) (For detailssee Fukumizu et al 2013 Song et al 2013)

34 Decoding from Empirical Kernel Means In general as shownabove a kernel mean mP is estimated as a weighted sum of feature vectors

mP =nsum

i=1

wik(middot Xi) (34)

with samples X1 Xn isin X and (possibly negative) weights w1 wn isinR Suppose mP is close to mP that is mP minus mPH is small Then mP issupposed to have accurate information about P as mP preserves all theinformation of P

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 11: Filtering with State-Observation Examples via Kernel Monte ...

392 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

How can we decode the information of P from mP The empirical ker-nel mean equation 34 has the following property which is due to thereproducing property of the kernel

〈mP f 〉H =nsum

i=1

wi f (Xi) forall f isin H (35)

Namely the weighted average of any function in the RKHS is equal to theinner product between the empirical kernel mean and that function Thisis analogous to the property 32 of the population kernel mean mP Let fbe any function in H From these properties equations 32 and 35 we have

∣∣∣∣∣EXsimP[ f (X)] minusnsum

i=1

wi f (Xi)

∣∣∣∣∣ = |〈mP minus mP f 〉H| le mP minus mPH fH

where we used the Cauchy-Schwartz inequality Therefore the left-handside will be close to 0 if the error mP minus mPH is small This shows that theexpectation of f can be estimated by the weighted average

sumni=1 wi f (Xi)

Note that here f is a function in the RKHS but the same can also be shownfor functions outside the RKHS under certain assumptions (Kanagawa ampFukumizu 2014) In this way the estimator of the form 34 provides es-timators of moments probability masses on sets and the density func-tion (if this exists) We explain this in the context of state-space models insection 44

35 Kernel Herding Here we explain kernel herding (Chen et al 2010)another building block of the proposed filter Suppose the kernel mean mPis known We wish to generate samples x1 x2 x isin X such that theempirical mean mP = 1

sumi=1 k(middot xi) is close to mP that is mP minus mPH is

small This should be done only using mP Kernel herding achieves this bygreedy optimization using the following update equations

x1 = arg maxxisinX

mP(x) (36)

x = arg maxxisinX

mP(x) minus 1

minus1sumi=1

k(x xi) ( ge 2) (37)

where mP(x) denotes the evaluation of mP at x (recall that mP is a functionin H)

An intuitive interpretation of this procedure can be given if there is aconstant R gt 0 such that k(x x) = R for all x isin X (eg R = 1 if k is gaussian)Suppose that x1 xminus1 are already calculated In this case it can be shown

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 12: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 393

that x in equation 37 is the minimizer of

E =∥∥∥∥∥mP minus 1

sumi=1

k(middot xi)

∥∥∥∥∥H

(38)

Thus kernel herding performs greedy minimization of the distance betweenmP and the empirical kernel mean mP = 1

sumi=1 k(middot xi)

It can be shown that the error E of equation 38 decreases at a rate at leastO(minus12) under the assumption that k is bounded (Bach Lacoste-Julien ampObozinski 2012) In other words the herding samples x1 x provide aconvergent approximation of mP In this sense kernel herding can be seen asa (pseudo) sampling method Note that mP itself can be an empirical kernelmean of the form 34 These properties are important for our resamplingalgorithm developed in section 42

It should be noted that E decreases at a faster rate O(minus1) under a certainassumption (Chen et al 2010) this is much faster than the rate of iidsamples O(minus12) Unfortunately this assumption holds only when H isfinite dimensional (Bach et al 2012) and therefore the fast rate of O(minus1)

has not been guaranteed for infinite-dimensional cases Nevertheless thisfast rate motivates the use of kernel herding in the data reduction methodin section C2 in appendix C (we will use kernel herding for two differentpurposes)

4 Kernel Monte Carlo Filter

In this section we present our kernel Monte Carlo filter (KMCF) Firstwe define notation and review the problem setting in section 41 We thendescribe the algorithm of KMCF in section 42 We discuss implementationissues such as hyperparameter selection and computational cost in section43 We explain how to decode the information on the posteriors from theestimated kernel means in section 44

41 Notation and Problem Setup Here we formally define the setupexplained in section 1 The notation is summarized in Table 1

We consider a state-space model (see Figure 1) Let X and Y be mea-surable spaces which serve as a state space and an observation space re-spectively Let x1 xt xT isin X be a sequence of hidden states whichfollow a Markov process Let p(xt |xtminus1) denote a transition model that de-fines this Markov process Let y1 yt yT isin Y be a sequence of obser-vations Each observation yt is assumed to be generated from an observationmodel p(yt |xt ) conditioned on the corresponding state xt We use the abbre-viation y1t = y1 yt

We consider a filtering problem of estimating the posterior distributionp(xt |y1t ) for each time t = 1 T The estimation is to be done online

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 13: Filtering with State-Observation Examples via Kernel Monte ...

394 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Table 1 Notation

X State spaceY Observation spacext isin X State at time tyt isin Y Observation at time tp(yt |xt ) Observation modelp(xt |xtminus1) Transition model(XiYi)n

i=1 State-observation exampleskX Positive-definite kernel on XkY Positive-definite kernel on YHX RKHS associated with kXHY RKHS associated with kY

as each yt is given Specifically we consider the following setting (see alsosection 1)

1 The observation model p(yt |xt ) is not known explicitly or even para-metrically Instead we are given examples of state-observation pairs(XiYi)n

i=1 sub X times Y prior to the test phase The observation modelis also assumed time homogeneous

2 Sampling from the transition model p(xt |xtminus1) is possible Its prob-abilistic model can be an arbitrary nonlinear nongaussian distribu-tion as for standard particle filters It can further depend on timeFor example control input can be included in the transition model asp(xt |xtminus1) = p(xt |xtminus1 ut ) where ut denotes control input providedby a user at time t

Let kX X times X rarr R and kY Y times Y rarr R be positive-definite kernels onX and Y respectively Denote by HX and HY their respective RKHSs Weaddress the above filtering problem by estimating the kernel means of theposteriors

mxt |y1t=

intkX (middot xt )p(xt |y1t )dxt isin HX (t = 1 T ) (41)

These preserve all the information of the corresponding posteriors if thekernels are characteristic (see section 32) Therefore the resulting estimatesof these kernel means provide us the information of the posteriors as ex-plained in section 44

42 Algorithm KMCF iterates three steps of prediction correction andresampling for each time t Suppose that we have just finished the iterationat time t minus 1 Then as shown later the resampling step yields the followingestimator of equation 41 at time t minus 1

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 14: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 395

Figure 2 One iteration of KMCF Here X1 X8 and Y1 Y8 denote statesand observations respectively in the state-observation examples (XiYi)n

i=1(suppose n = 8) 1 Prediction step The kernel mean of the prior equation 45 isestimated by sampling with the transition model p(xt |xtminus1) 2 Correction stepThe kernel mean of the posterior equation 41 is estimated by applying kernelBayesrsquo rule (see algorithm 1) The estimation makes use of the informationof the prior (expressed as m

π= (mxt |y1tminus1

(Xi)) isin R8) as well as that of a new

observation yt (expressed as kY = (kY (ytYi)) isin R8) The resulting estimate

equation 46 is expressed as a weighted sample (wti Xi)ni=1 Note that the

weights may be negative 3 Resampling step Samples associated with smallweights are eliminated and those with large weights are replicated by applyingkernel herding (see algorithm 2) The resulting samples provide an empiricalkernel mean equation 47 which will be used in the next iteration

mxtminus1|y1tminus1= 1

n

nsumi=1

kX (middot Xtminus1i) (42)

where Xtminus11 Xtminus1n isin X We show one iteration of KMCF that estimatesthe kernel mean (41) at time t (see also Figure 2)

421 Prediction Step The prediction step is as follows We generate asample from the transition model for each Xtminus1i in equation 42

Xti sim p(xt |xtminus1 = Xtminus1i) (i = 1 n) (43)

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 15: Filtering with State-Observation Examples via Kernel Monte ...

396 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We then specify a new empirical kernel mean

mxt |y1tminus1= 1

n

nsumi=1

kX (middot Xti) (44)

This is an estimator of the following kernel mean of the prior

mxt |y1tminus1=

intkX (middot xt )p(xt |y1tminus1)dxt isin HX (45)

where

p(xt |y1tminus1) =int

p(xt |xtminus1)p(xtminus1|y1tminus1)dxtminus1

is the prior distribution of the current state xt Thus equation 44 serves asa prior for the subsequent posterior estimation

In section 5 we theoretically analyze this sampling procedure in detailand provide justification of equation 44 as an estimator of the kernel meanequation 45 We emphasize here that such an analysis is necessary eventhough the sampling procedure is similar to that of a particle filter thetheory of particle methods does not provide a theoretical justification ofequation 44 as a kernel mean estimator since it deals with probabilities asempirical distributions

422 Correction Step This step estimates the kernel mean equation 41of the posterior by using kernel Bayesrsquo rule (see algorithm 1) in section 33This makes use of the new observation yt the state-observation examples(XiYi)n

i=1 and the estimate equation 44 of the priorThe input of algorithm 1 consists of (1) vectors

kY = (kY (ytY1) kY (ytYn))T isin Rn

mπ = (mxt |y1tminus1(X1) mxt |y1tminus1

(Xn))T

=(

1n

nsumi=1

kX (Xq Xti)

)n

q=1

isin Rn

which are interpreted as expressions of yt and mxt |y1tminus1using the sample

(XiYi)ni=1 (2) kernel matrices GX = (kX (Xi Xj)) GY = (kY (YiYj)) isin

Rntimesn and (3) regularization constants ε δ gt 0 These constants ε δ as well

as kernels kX kY are hyperparameters of KMCF (we discuss how to choosethese parameters later)

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 16: Filtering with State-Observation Examples via Kernel Monte ...

Filtering with State-Observation Examples 397

Algorithm 1 outputs a weight vector w = (w1 wn) isin Rn Normaliz-

ing these weights wt = wsumn

i=1 wi we obtain an estimator of equation 413

as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (46)

The apparent difference from a particle filter is that the posterior (ker-nel mean) estimator equation 46 is expressed in terms of the samplesX1 Xn in the training sample (XiYi)n

i=1 not with the samples fromthe prior equation 44 This requires that the training samples X1 Xncover the support of posterior p(xt |y1t ) sufficiently well If this does nothold we cannot expect good performance for the posterior estimate Notethat this is also true for any methods that deal with the setting of this letterpoverty of training samples in a certain region means that we do not haveany information about the observation model p(yt |xt ) in that region

423 Resampling Step This step applies the update equations 36 and37 of kernel herding in section 35 to the estimate equation 46 This is toobtain samples Xt1 Xtn such that

mxt |y1t= 1

n

nsumi=1

kX (middot Xti) (47)

is close to equation 46 in the RKHS Our theoretical analysis in section 5shows that such a procedure can reduce the error of the prediction step attime t + 1

The procedure is summarized in algorithm 2 Specifically we gener-ate each Xti by searching the solution of the optimization problem in

3For this normalization procedure see the discussion in section 43

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 17: Filtering with State-Observation Examples via Kernel Monte ...

398 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

equations 36 and 37 from a finite set of samples X1 Xn in equation46 We allow repetitions in Xt1 Xtn We can expect that the resultingequation 47 is close to equation 46 in the RKHS if the samples X1 Xncover the support of the posterior p(xt |y1t ) sufficiently This is verified bythe theoretical analysis of section 53

Here searching for the solutions from a finite set reduces the computa-tional costs of kernel herding It is possible to search from the entire space Xif we have sufficient time or if the sample size n is small enough it dependson applications and available computational resources We also note thatthe size of the resampling samples is not necessarily n this depends on howaccurately these samples approximate equation 46 Thus a smaller numberof samples may be sufficient In this case we can reduce the computationalcosts of resampling as discussed in section 52

The aim of our resampling step is similar to that of the resamplingstep of a particle filter (see Doucet amp Johansen 2011) Intuitively the aimis to eliminate samples with very small weights and replicate those withlarge weights (see Figures 2 and 3) In particle methods this is realized bygenerating samples from the empirical distribution defined by a weightedsample (therefore this procedure is called resampling) Our resamplingstep is a realization of such a procedure in terms of the kernel mean embed-ding we generate samples Xt1 Xtn from the empirical kernel meanequation 46

Note that the resampling algorithm of particle methods is not appropri-ate for use with kernel mean embeddings This is because it assumes thatweights are positive but our weights in equation 46 can be negative asthis equation is a kernel mean estimator One may apply the resamplingalgorithm of particle methods by first truncating the samples with nega-tive weights However there is no guarantee that samples obtained by thisheuristic produce a good approximation of equation 46 as a kernel meanas shown by experiments in section 61 In this sense the use of kernel herd-ing is more natural since it generates samples that approximate a kernelmean

424 Overall Algorithm We summarize the overall procedure of KMCFin algorithm 3 where pinit denotes a prior distribution for the initial state x1For each time t KMCF takes as input an observation yt and outputs a weightvector wt = (wt1 wtn)T isin R

n Combined with the samples X1 Xnin the state-observation examples (XiYi)n

i=1 these weights provide anestimator equation 46 of the kernel mean of posterior equation 41

We first compute kernel matrices GX GY (lines 4ndash5) which are usedin algorithm 1 of kernel Bayesrsquo rule (line 15) For t = 1 we generate aniid sample X11 X1n from the initial distribution pinit (line 8) whichprovides an estimator of the prior corresponding to equation 44 Line 10 isthe resampling step at time t minus 1 and line 11 is the prediction step at timet Lines 13 to 16 correspond to the correction step

Filtering with State-Observation Examples 399

43 Discussion The estimation accuracy of KMCF can depend on sev-eral factors in practice

431 Training Samples We first note that training samples (XiYi)ni=1

should provide the information concerning the observation model p(yt |xt )For example (XiYi)n

i=1 may be an iid sample from a joint distributionp(xy) on X times Y which decomposes as p(x y) = p(y|x)p(x) Here p(y|x) isthe observation model and p(x) is some distribution on X The support ofp(x) should cover the region where states x1 xT may pass in the testphase as discussed in section 42 For example this is satisfied when thestate-space X is compact and the support of p(x) is the entire X

Note that training samples (XiYi)ni=1 can also be non-iid in prac-

tice For example we may deterministically select X1 Xn so that theycover the region of interest In location estimation problems in roboticsfor instance we may collect location-sensor examples (XiYi)n

i=1 so thatlocations X1 Xn cover the region where location estimation is to beconducted (Quigley et al 2010)

432 Hyperparameters As in other kernel methods in general the perfor-mance of KMCF depends on the choice of its hyperparameters which arethe kernels kX and kY (or parameters in the kernelsmdasheg the bandwidthof the gaussian kernel) and the regularization constants δ ε gt 0 We needto define these hyperparameters based on the joint sample (XiYi)n

i=1 be-fore running the algorithm on the test data y1 yT This can be done bycross-validation Suppose that (XiYi)n

i=1 is given as a sequence from the

400 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

state-space model We can then apply two-fold cross-validation by dividingthe sequence into two subsequences If (XiYi)n

i=1 is not a sequence we canrely on the cross-validation procedure for kernel Bayesrsquo rule (see section 42of Fukumizu et al 2013)

433 Normalization of Weights We found in our preliminary experimentsthat normalization of the weights (see line 16 algorithm 3) is beneficial tothe filtering performance This may be justified by the following discus-sion about a kernel mean estimator in general Let us consider a consis-tent kernel mean estimator mP = sumn

i=1 wik(middot Xi) such that limnrarrinfin mP minusmPH = 0 Then we can show that the sum of the weights converges to 1limnrarrinfin

sumni=1 wi = 1 under certain assumptions (Kanagawa amp Fukumizu

2014) This could be explained as follows Recall that the weighted averagesumni=1 wi f (Xi) of a function f is an estimator of the expectation

intf (x)dP(x)

Let f be a function that takes the value 1 for any input f (x) = 1 forallx isin X Then we have

sumni=1 wi f (Xi) = sumn

i=1 wi andint

f (x)dP(x) = 1 Thereforesumni=1 wi is an estimator of 1 In other words if the error mP minus mPH is

small then the sum of the weightssumn

i=1 wi should be close to 1 Converselyif the sum of the weights is far from 1 it suggests that the estimate mP isnot accurate Based on this theoretical observation we suppose that nor-malization of the weights (this makes the sum equal to 1) results in a betterestimate

434 Time Complexity For each time t the naive implementation of algo-rithm 3 requires a time complexity of O(n3) for the size n of the joint sample(XiYi)n

i=1 This comes from algorithm 1 in line 15 (kernel Bayesrsquo rule) andalgorithm 2 in line 10 (resampling) The complexity O(n3) of algorithm 1 isdue to the matrix inversions Note that one of the inversions (GX + nεIn)minus1can be computed before the test phase as it does not involve the test dataAlgorithm 2 also has complexity O(n3) In section 52 we will explain howthis cost can be reduced to O(n2) by generating only lt n samples byresampling

435 Speeding Up Methods In appendix C we describe two methods forreducing the computational costs of KMCF both of which only need tobe applied prior to the test phase The first is a low-rank approximationof kernel matrices GX GY which reduces the complexity to O(nr2) wherer is the rank of low-rank matrices Low-rank approximation works wellin practice since eigenvalues of a kernel matrix often decay very rapidlyIndeed this has been theoretically shown for some cases (see Widom 19631964 and discussions in Bach amp Jordan 2002) Second is a data-reductionmethod based on kernel herding which efficiently selects joint subsamplesfrom the training set (XiYi)n

i=1 Algorithm 3 is then applied based onlyon those subsamples The resulting complexity is thus O(r3) where r is the

Filtering with State-Observation Examples 401

number of subsamples This method is motivated by the fast convergencerate of kernel herding (Chen et al 2010)

Both methods require the number r to be chosen which is either the rankfor low-rank approximation or the number of subsamples in data reductionThis determines the trade-off between accuracy and computational timeIn practice there are two ways of selecting the number r By regardingr as a hyperparameter of KMCF we can select it by cross-validation orwe can choose r by comparing the resulting approximation error which ismeasured in a matrix norm for low-rank approximation and in an RKHSnorm for the subsampling method (For details see appendix C)

436 Transfer Learning Setting We assumed that the observation modelin the test phase is the same as for the training samples However thismight not hold in some situations For example in the vision-based local-ization problem the illumination conditions for the test and training phasesmight be different (eg the test is done at night while the training samplesare collected in the morning) Without taking into account such a signifi-cant change in the observation model KMCF would not perform well inpractice

This problem could be addressed by exploiting the framework of transferlearning (Pan amp Yang 2010) This framework aims at situations where theprobability distribution that generates test data is different from that oftraining samples The main assumption is that there exist a small number ofexamples from the test distribution Transfer learning then provides a wayof combining such test examples and abundant training samples therebyimproving the test performance The application of transfer learning in oursetting remains a topic for future research

44 Estimation of Posterior Statistics By algorithm 3 we obtain theestimates of the kernel means of posteriors equation 41 as

mxt |y1t=

nsumi=1

wtikX (middot Xi) (t = 1 T ) (48)

These contain the information on the posteriors p(xt |y1t ) (see sections 32and 34) We now show how to estimate statistics of the posteriors usingthese estimates For ease of presentation we consider the case X = R

dTheoretical arguments to justify these operations are provided by Kana-gawa and Fukumizu (2014)

441 Mean and Covariance Consider the posterior meanint

xt p(xt |y1t )dxtisin R

d and the posterior (uncentered) covarianceint

xtxTt p(xt |y1t )dxt isin R

dtimesd

402 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

These quantities can be estimated as

nsumi=1

wtiXi (mean)

nsumi=1

wtiXiXTi (covariance)

442 Probability Mass Let A sub X be a measurable set with smoothboundary Define the indicator function IA(x) by IA(x) = 1 for x isin A andIA(x) = 0 otherwise Consider the probability mass

intIA(x)p(xt |y1t )dxt This

can be estimated assumn

i=1 wtiIA(Xi)

443 Density Suppose p(xt |y1t ) has a density function Let J(x) be asmoothing kernel satisfying

intJ(x)dx = 1 and J(x) ge 0 Let h gt 0 and define

Jh(x) = 1hd J

( xh

) Then the density of p(xt |y1t ) can be estimated as

p(xt |y1t ) =nsum

i=1

wtiJh(xt minus Xi) (49)

with an appropriate choice of h

444 Mode The mode may be obtained by finding a point that maxi-mizes equation 49 However this requires a careful choice of h Instead wemay use Ximax

with imax = arg maxi wti as a mode estimate This is the pointin X1 Xn that is associated with the maximum weight in wt1 wtnThis point can be interpreted as the point that maximizes equation 49 inthe limit of h rarr 0

445 Other Methods Other ways of using equation 48 include the preim-age computation and fitting of gaussian mixtures (See eg Song et al 2009Fukumizu et al 2013 McCalman OrsquoCallaghan amp Ramos 2013)

5 Theoretical Analysis

In this section we analyze the sampling procedure of the prediction stepin section 42 Specifically we derive an upper bound on the error of theestimator 44 We also discuss in detail how the resampling step in section 42works as a preprocessing step of the prediction step

To make our analysis clear we slightly generalize the setting of theprediction step and discuss the sampling and resampling procedures inthis setting

51 Error Bound for the Prediction Step Let X be a measurable spaceand P be a probability distribution on X Let p(middot|x) be a conditional

Filtering with State-Observation Examples 403

distribution on X conditioned on x isin X Let Q be a marginal distributionon X defined by Q(B) = int

p(B|x)dP(x) for all measurable B sub X In the fil-tering setting of section 4 the space X corresponds to the state space andthe distributions P p(middot|x) and Q correspond to the posterior p(xtminus1|y1tminus1)

at time t minus 1 the transition model p(xt |xtminus1) and the prior p(xt |y1tminus1) attime t respectively

Let kX be a positive-definite kernel onX andHX be the RKHS associatedwith kX Let mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x) be the kernel

means of P and Q respectively Suppose that we are given an empiricalestimate of mP as

mP =nsum

i=1

wikX (middot Xi) (51)

where w1 wn isin R and X1 Xn isin X Considering this weighted sam-ple form enables us to explain the mechanism of the resampling step

The prediction step can then be cast as the following procedure for eachsample Xi we generate a new sample X prime

i with the conditional distributionX prime

i sim p(middot|Xi) Then we estimate mQ by

mQ =nsum

i=1

wikX (middot X primei ) (52)

which corresponds to the estimate 44 of the prior kernel mean at time tThe following theorem provides an upper bound on the error of equa-

tion 52 and reveals properties of equation 51 that affect the error of theestimator equation 52 The proof is given in appendix A

Theorem 1 Let mP be a fixed estimate of mP given by equation 51 Define afunction θ on X times X by θ (x1 x2) =

int intkX (xprime

1 xprime2)dp(xprime

1|x1)dp(xprime2|x2)forallx1 x2 isin

X times X and assume that θ is included in the tensor RKHS HX otimes HX 4 The

4 The tensor RKHS HX otimes HX is the RKHS of a product kernel kXtimesX on X times X de-fined as kXtimesX ((xa xb) (xc xd )) = kX (xa xc)kX (xb xd ) forall(xa xb) (xc xd ) isin X times X Thisspace HX otimes HX consists of smooth functions on X times X if the kernel kX is smooth (egif kX is gaussian see section 4 of Steinwart amp Christmann 2008) In this case we caninterpret this assumption as requiring that θ be smooth as a function on X times X

The function θ can be written as the inner product between the kernel means ofthe conditional distributions θ (x1 x2) = 〈mp(middot|x1 )

mp(middot|x2 )〉HX

where mp(middot|x)= int

kX (middot xprime)dp(xprime|x) Therefore the assumption may be further seen as requiring that the mapx rarr mp(middot|x)

be smooth Note that while similar assumptions are common in the litera-ture on kernel mean embeddings (eg theorem 5 of Fukumizu et al 2013) we may relaxthis assumption by using approximate arguments in learning theory (eg theorems 22and 23 of Eberts amp Steinwart 2013) This analysis remains a topic for future research

404 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

estimator mQ equation 52 then satisfies

EXprime1Xprime

n[mQ minus mQ2

HX]

lensum

i=1

w2i (EXprime

i[kX (Xprime

i Xprimei )] minus EXprime

i Xprimei[kX (Xprime

i Xprimei )]) (53)

+ mP minus mP2HX

θHX otimesHX (54)

where Xprimei sim p(middot|Xi ) and Xprime

i is an independent copy of Xprimei

From theorem 1 we can make the following observations First thesecond term equation 54 of the upper bound shows that the error of theestimator equation 52 is likely to be large if the given estimate equation 51has large error mP minus mP2

HX which is reasonable to expect

Second the first term equation 53 shows that the error of equation52 can be large if the distribution of X prime

i (ie p(middot|Xi)) has large varianceFor example suppose X prime

i = f (Xi) + εi where f X rarr X is some mappingand εi is a random variable with mean 0 Let kX be the gaussian ker-nel kX (x xprime) = exp(minusx minus xprime22α) for some α gt 0 Then EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] increases from 0 to 1 as the variance of εi (ie the vari-

ance of X primei ) increases from 0 to infinity Therefore in this case equation

53 is upper-bounded at worst bysumn

i=1 w2i Note that EX prime

i[kX (X prime

i X primei )] minus

EX primei X

primei[kX (X prime

i X primei )] is always nonnegative5

511 Effective Sample Size Now let us assume that the kernel kX isbounded there is a constant C gt 0 such that supxisinX kX (x x) lt C Thenthe inequality of theorem 1 can be further bounded as

EX prime1X

primen[mQ minusmQ2

HX] le 2C

nsumi=1

w2i +mP minusmP2

HXθHX otimesHX

(55)

This bound shows that two quantities are important in the estimateequation 51 (1) the sum of squared weights

sumni=1 w2

i and (2) the errormP minus mP2

HX In other words the error of equation 52 can be large if the

quantitysumn

i=1 w2i is large regardless of the accuracy of equation 51 as an

estimator of mP In fact the estimator of the form 51 can have largesumn

i=1 w2i

even when mP minus mP2HX

is small as shown in section 61

5To show this it is sufficient to prove thatintint

kX (x x)dP(x)dP(x) le intkX (x x)dP(x)

for any probability P This can be shown as followsintint

kX (x x)dP(x)dP(x) = intint 〈kX (middot x)

kX (middot x)〉HXdP(x)dP(x) le intint radic

kX (x x)radic

kX (x x)dP(x)dP(x) le intkX (x x)dP(x) Here we

used the reproducing property the Cauchy-Schwartz inequality and Jensenrsquos inequality

Filtering with State-Observation Examples 405

Figure 3 An illustration of the sampling procedure with (right) and without(left) the resampling algorithm The left panel corresponds to the kernel meanestimators equations 51 and 52 in section 51 and the right panel correspondsto equations 56 and 57 in section 52

The inverse of the sum of the squared weights 1sumn

i=1 w2i can be in-

terpreted as the effective sample size (ESS) of the empirical kernel meanequation 51 To explain this suppose that the weights are normalizedsumn

i=1 wi = 1 Then ESS takes its maximum n when the weights are uniformw1 = middot middot middot wn = 1n It becomes small when only a few samples have largeweights (see the left side in Figure 3) Therefore the bound equation 55can be interpreted as follows To make equation 52 a good estimator ofmQ we need to have equation 51 such that the ESS is large and the errormP minus mPH is small Here we borrowed the notion of ESS from the litera-ture on particle methods in which ESS has also been played an importantrole (see section 253 of Liu 2001 and section 35 of Doucet amp Johansen2011)

52 Role of Resampling Based on these arguments we explain how theresampling step in section 42 works as a preprocessing step for the samplingprocedure Consider mP in equation 51 as an estimate equation 46 givenby the correction step at time t minus 1 Then we can think of mQ equation 52as an estimator of the kernel mean equation 45 of the prior without theresampling step

The resampling step is application of kernel herding to mP to obtainsamples X1 Xn which provide a new estimate of mP with uniformweights

mP = 1n

nsumi=1

kX (middot Xi) (56)

The subsequent prediction step is to generate a sample X primei sim p(middot|Xi) for each

Xi (i = 1 n) and estimate mQ as

406 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

mQ = 1n

nsumi=1

kX (middot X primei ) (57)

Theorem 1 gives the following bound for this estimator that corresponds toequation 55

EX prime1X

primen[mQ minus mQ2

HX] le 2C

n+ mP minus mP2

HθHX otimesHX (58)

A comparison of the upper bounds of equations 55 and 58 implies thatthe resampling step is beneficial when

sumni=1 w2

i is large (ie the ESS is small)and mP minus mPHX

is small The condition on mP minus mPHXmeans that the

loss by kernel herding (in terms of the RKHS distance) is small This impliesmP minus mPHX

asymp mP minus mPHX so the second term of equation 58 is close

to that of equation 55 On the other hand the first term of equation 58 willbe much smaller than that of equation 55 if

sumni=1 w2

i 1n In other wordsthe resampling step improves the accuracy of the sampling procedure byincreasing the ESS of the kernel mean estimate mP This is illustrated inFigure 3

The above observations lead to the following procedures

521 When to Apply Resampling Ifsumn

i=1 w2i is not large the gain by

the resampling step will be small Therefore the resampling algorithmshould be applied when

sumni=1 w2

i is above a certain threshold say 2n Thesame strategy has been commonly used in particle methods (see Doucet ampJohansen 2011)

Also the bound equation 53 of theorem 1 shows that resampling is notbeneficial if the variance of the conditional distribution p(middot|x) is very small(ie if state transition is nearly deterministic) In this case the error of thesampling procedure may increase due to the loss mP minus mPHX

caused bykernel herding

522 Reduction of Computational Cost Algorithm 2 generates n samplesX1 Xn with time complexity O(n3) Suppose that the first samplesX1 X where lt n already approximate mP well 1

sumi=1 kX (middot Xi) minus

mPHXis small We do not then need to generate the rest of samples

X+1 Xn we can make n samples by copying the samples n times(suppose n can be divided by for simplicity say n = 2) Let X1 Xn

denote these n samples Then 1

sumi=1 kX (middot Xi) = 1

n

sumni=1 kX (middot Xi) by defi-

nition so 1n

sumni=1 kX (middot Xi) minus mPHX

is also small This reduces the time

complexity of algorithm 2 to O(n2)One might think that it is unnecessary to copy n times to make n

samples This is not true however Suppose that we just use the first

Filtering with State-Observation Examples 407

samples to define mP = 1

sumi=1 kX (middot Xi) Then the first term of equation

58 becomes 2C which is larger than 2Cn of n samples This differenceinvolves sampling with the conditional distribution X prime

i sim p(middot|Xi) If we usejust the samples sampling is done times If we use the copied n samplessampling is done n times Thus the benefit of making n samples comesfrom sampling with the conditional distribution many times This matchesthe bound of theorem 1 where the first term involves the variance of theconditional distribution

53 Convergence Rates for Resampling Our resampling algorithm (seealgorithm 2) is an approximate version of kernel herding in section 35algorithm 2 searches for the solutions of the update equations 36 and 37from a finite set X1 Xn sub X not from the entire space X Thereforeexisting theoretical guarantees for kernel herding (Chen et al 2010 Bachet al 2012) do not apply to algorithm 2 Here we provide a theoreticaljustification

531 Generalized Version We consider a slightly generalized versionshown in algorithm 4 It takes as input (1) a kernel mean estimator mP of akernel mean mP (2) candidate samples Z1 ZN and (3) the number ofresampling It then outputs resampling samples X1 X isin Z1 ZNwhich form a new estimator mP = 1

sumi=1 kX (middot Xi) Here N is the number

of the candidate samplesAlgorithm 4 searches for solutions of the update equations 36 and 37

from the candidate set Z1 ZN Note that here these samples Z1 ZNcan be different from those expressing the estimator mP If they are thesamemdashthe estimator is expressed as mP = sumn

i=1 wtik(middot Xi) with n = N andXi = Zi (i = 1 n)mdashthen algorithm 4 reduces to algorithm 2 In facttheorem 2 allows mP to be any element in the RKHS

532 Convergence Rates in Terms of N and Algorithm 4 gives the newestimator mP of the kernel mean mP The error of this new estimatormP minus mPHX

should be close to that of the given estimator mP minus mPHX

408 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Theorem 2 guarantees this In particular it provides convergence rates ofmP minus mPHX

approaching mP minus mPHX as N and go to infinity This

theorem follows from theorem 3 in appendix B which holds under weakerassumptions

Theorem 2 Let mP be the kernel mean of a distribution P and mP be any element inthe RKHS HX Let Z1 ZN be an iid sample from a distribution with densityq Assume that P has a density function p such that supxisinX p(x)q (x) lt infin LetX1 X be samples given by algorithm 4 applied to mP with candidate samplesZ1 ZN Then for mP = 1

sumi=1 k(middot Xi ) we have

mP minus mP2HX

= (mP minus mPHX+ Op(Nminus12))2 + O

(ln

) (N rarr infin) (59)

Our proof in appendix B relies on the fact that kernel herding can beseen as the Frank-Wolfe optimization method (Bach et al 2012) Indeedthe error O(ln ) in equation 59 comes from the optimization error of theFrank-Wolfe method after iterations (Freund amp Grigas 2014 bound 32)The error Op(N

minus12) is due to the approximation of the solution space by afinite set Z1 ZN These errors will be small if N and are large enoughand the error of the given estimator mP minus mPHX

is relatively large Thisis formally stated in corollary 1 below

Theorem 2 assumes that the candidate samples are iid with a density qThe assumption supxisinX p(x)q(x) lt infin requires that the support of q con-tains that of p This is a formal characterization of the explanation in section42 that the samples X1 XN should cover the support of P sufficientlyNote that the statement of theorem 2 also holds for non-iid candidatesamples as shown in theorem 3 of appendix B

533 Convergence Rates as mP Goes to mP Theorem 2 provides conver-gence rates when the estimator mP is fixed In corollary 1 below we let mPapproach mP and provide convergence rates for mP of algorithm 4 approach-ing mP This corollary directly follows from theorem 2 since the constantterms in Op(N

minus12) and O(ln ) in equation 59 do not depend on mPwhich can be seen from the proof in section B

Corollary 1 Assume that P and Z1 ZN satisfy the conditions in theorem 2for all N Let m(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as

n rarr infin for some constant b gt 06 Let N = = n2b Let X(n)1 X(n)

be samples

6Here the estimator m(n)P and the candidate samples Z1 ZN can be dependent

Filtering with State-Observation Examples 409

given by algorithm 4 applied to m(n)P with candidate samples Z1 ZN Then

for m(n)P = 1

sumi=1 kX (middot X(n)

i ) we have

m(n)P minus mPHX

= Op(nminusb) (n rarr infin) (510)

Corollary 1 assumes that the estimator m(n)

P converges to mP at a rateOp(n

minusb) for some constant b gt 0 Then the resulting estimator m(n)

P by algo-rithm 4 also converges to mP at the same rate O(nminusb) if we set N = = n2bThis implies that if we use sufficiently large N and the errors Op(N

minus12)

and O(ln ) in equation 59 can be negligible as stated earlier Note thatN = = n2b implies that N and can be smaller than n since typically wehave b le 12 (b = 12 corresponds to the convergence rates of parametricmodels) This provides a support for the discussion in section 52 (reductionof computational cost)

534 Convergence Rates of Sampling after Resampling We can derive con-vergence rates of the estimator mQ equation 57 in section 52 Here we con-sider the following construction of mQ as discussed in section 52 (reductionof computational cost) First apply algorithm 4 to m(n)

P and obtain resam-pling samples X (n)

1 X (n) isin Z1 ZN Then copy these samples n

times and let X (n)

1 X (n)n be the resulting times n samples Finally

sample with the conditional distribution Xprime(n)i sim p(middot|Xi) (i = 1 n)

and define

m(n)

Q = 1n

nsumi=1

kX (middot Xprime(n)i ) (511)

The following corollary is a consequence of corollary 1 theorem 1 andthe bound equation 58 Note that theorem 1 obtains convergence in expec-tation which implies convergence in probability

Corollary 2 Let θ be the function defined in theorem 1 and assume θ isin HX otimes HX Assume that P and Z1 ZN satisfy the conditions in theorem 2 for all N Letm(n)

P be an estimator of mP such that m(n)P minus mPHX

= Op(nminusb) as n rarr infin forsome constant b gt 0 Let N = = n2b Then for the estimator m(n)

Q defined asequation 511 we have

m(n)Q minus mQHX

= Op(nminusmin(b12)) (n rarr infin)

Suppose b le 12 which holds with basically any nonparametric esti-mators Then corollary 2 shows that the estimator m(n)

Q achieves the sameconvergence rate as the input estimator m(n)

P Note that without resampling

410 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the rate becomes Op(

radicsumni=1(w

(n)i )2 + nminusb) where the weights are given by

the input estimator m(n)

P = sumni=1 w

(n)i kX (middot X (n)

i ) (see the bound equation55) Thanks to resampling (the square root of) the sum of the squaredweights in the case of corollary 2 becomes 1

radicn le 1

radicn which is

usually smaller thanradicsumn

i=1(w(n)i )2 and is faster than or equal to Op(n

minusb)This shows the merit of resampling in terms of convergence rates (see alsothe discussions in section 52)

54 Consistency of the Overall Procedure Here we show the consis-tency of the overall procedure in KMCF This is based on corollary 2 whichshows the consistency of the resampling step followed by the predictionstep and on theorem 5 of Fukumizu et al (2013) which guarantees theconsistency of kernel Bayesrsquo rule in the correction step Thus we considerthree steps in the following order resampling prediction and correctionMore specifically we show the consistency of the estimator equation 46of the posterior kernel mean at time t given that the one at time t minus 1 isconsistent

To state our assumptions we will need the following functions θpos Y times Y rarr R θobs X times X rarr R and θtra X times X rarr R

θpos(y y) =intint

kX (xt xt )dp(xt |y1tminus1 yt = y)dp(xt |y1tminus1 yt = y)

(512)

θobs(x x) =intint

kY (yt yt )dp(yt |xt = x)dp(yt |xt = x) (513)

θtra(x x) =intint

kX (xt xt )dp(xt |xtminus1 = x)dp(xt |xtminus1 = x) (514)

These functions contain the information concerning the distributions in-volved In equation 512 the distribution p(xt |y1tminus1 yt = y) denotes theposterior of the state at time t given that the observation at time t is yt = ySimilarly p(xt |y1tminus1 yt = y) is the posterior at time t given that the observa-tion is yt = y In equation 513 the distributions p(yt |xt = x) and p(yt |xt = x)

denote the observation model when the state is xt = x or xt = x respectivelyIn equation 514 the distributions p(xt |xtminus1 = x) and p(xt |xtminus1 = x) denotethe transition model with the previous state given by xtminus1 = x or xtminus1 = xrespectively

For simplicity of presentation we consider here N = = n for the resam-pling step Below denote by F otimes G the tensor product space of two RKHSsF and G

Corollary 3 Let (X1 Y1) (Xn Yn) be an iid sample with a joint densityp(x y) = p(y|x)q (x) where p(y|x) is the observation model Assume that the

Filtering with State-Observation Examples 411

posterior p(xt|y1t) has a density p and that supxisinX p(x)q (x) lt infin Assumethat the functions defined by equations 512 to 514 satisfy θpos isin HY otimes HY θobs isin HX otimes HX and θtra isin HX otimes HX respectively Suppose that mxtminus1|y1tminus1

minusmxtminus1|y1tminus1

HXrarr 0 as n rarr infin in probability Then for any sufficiently slow decay

of regularization constants εn and δn of algorithm 1 we have

mxt |y1tminus mxt |y1t

HXrarr 0 (n rarr infin)

in probability

Corollary 3 follows from theorem 5 of Fukumizu et al (2013) and corol-lary 2 The assumptions θpos isin HY otimes HY and θobs isin HX otimes HX are due totheorem 5 of Fukumizu et al (2013) for the correction step while the as-sumption θtra isin HX otimes HX is due to theorem 1 for the prediction step fromwhich corollary 2 follows As we discussed in note 4 of section 51 these es-sentially assume that the functions θpos θobs and θtra are smooth Theorem 5of Fukumizu et al (2013) also requires that the regularization constantsεn δn of kernel Bayesrsquo rule should decay sufficiently slowly as the samplesize goes to infinity (εn δn rarr 0 as n rarr infin) (For details see sections 52 and62 in Fukumizu et al 2013)

It would be more interesting to investigate the convergence rates of theoverall procedure However this requires a refined theoretical analysis ofkernel Bayesrsquo rule which is beyond the scope of this letter This is becausecurrently there is no theoretical result on convergence rates of kernel Bayesrsquorule as an estimator of a posterior kernel mean (existing convergence resultsare for the expectation of function values see theorems 6 and 7 in Fukumizuet al 2013) This remains a topic for future research

6 Experiments

This section is devoted to experiments In section 61 we conduct basicexperiments on the prediction and resampling steps before going on to thefiltering problem Here we consider the problem described in section 5 Insection 62 the proposed KMCF (see algorithm 3) is applied to syntheticstate-space models Comparisons are made with existing methods applica-ble to the setting of the letter (see also section 2) In section 63 we applyKMCF to the real problem of vision-based robot localization

In the following N(μ σ 2) denotes the gaussian distribution with meanμ isin R and variance σ 2 gt 0

61 Sampling and Resampling Procedures The purpose here is to seehow the prediction and resampling steps work empirically To this end weconsider the problem described in section 5 with X = R (see section 51 fordetails) Specifications of the problem are described below

412 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

We will need to evaluate the errors mP minus mPHXand mQ minus mQHX

sowe need to know the true kernel means mP and mQ To this end we definethe distributions and the kernel to be gaussian this allows us to obtainanalytic expressions for mP and mQ

611 Distributions and Kernel More specifically we define the marginalP and the conditional distribution p(middot|x) to be gaussian P = N(0 σ 2

P ) andp(middot|x) = N(x σ 2

cond) Then the resulting Q = intp(middot|x)dP(x) also becomes

gaussian Q = N(0 σ 2P + σ 2

cond) We define kX to be the gaussian kernelkX (x xprime) = exp(minus(x minus xprime)22γ 2) We set σP = σcond = γ = 01

612 Kernel Means Due to the convolution theorem of gaussian func-tions the kernel means mP = int

kX (middot x)dP(x) and mQ = intkX (middot x)dQ(x)

can be analytically computed mP(x) =radic

γ 2

σ 2+γ 2 exp(minus x2

2(γ 2+σ 2P )

) mQ(x) =radicγ 2

(σ 2+σ 2cond+γ 2 )

exp(minus x2

2(σ 2P+σ 2

cond+γ 2 ))

613 Empirical Estimates We artificially defined an estimate mP =sumni=1 wikX (middot Xi) as follows First we generated n = 100 samples

X1 X100 from a uniform distribution on [minusA A] with some A gt 0 (spec-ified below) We computed the weights w1 wn by solving an optimiza-tion problem

minwisinRn

nsum

i=1

wikX (middot Xi) minus mP2H + λw2

and then applied normalization so thatsumn

i=1 wi = 1 Here λ gt 0 is a regular-ization constant which allows us to control the trade-off between the errormP minus mP2

HXand the quantity

sumni=1 w2

i = w2 If λ is very small the re-sulting mP becomes accurate mP minus mP2

HXis small but has large

sumni=1 w2

i If λ is large the error mP minus mP2

HXmay not be very small but

sumni=1 w2

i

becomes small This enables us to see how the error mQ minus mQ2HX

changesas we vary these quantities

614 Comparison Given mP = sumni=1 wikX (middot Xi) we wish to estimate the

kernel mean mQ We compare three estimators

bull woRes Estimate mQ without resampling Generate samples X primei sim

p(middot|Xi) to produce the estimate mQ = sumni=1 wikX (middot X prime

i ) This corre-sponds to the estimator discussed in section 51

bull Res-KH First apply the resampling algorithm of algorithm 2 to mPyielding X1 Xn Then generate X prime

i sim p(middot|Xi) for each Xi giving

Filtering with State-Observation Examples 413

Figure 4 Results of the experiments from section 61 (Top left and right)Sample-weight pairs of mP = sumn

i=1 wikX (middot Xi) and mQ = sumni=1 wik(middot X prime

i ) (Mid-dle left and right) Histogram of samples X1 Xn generated by algorithm2 and that of samples X prime

1 X primen from the conditional distribution (Bottom

left and right) Histogram of samples generated with multinomial resamplingafter truncating negative weights and that of samples from the conditionaldistribution

the estimate mQ = 1n

sumni=1 k(middot X prime

i ) This is the estimator discussed insection 52

bull Res-Trunc Instead of algorithm 2 first truncate negative weights inw1 wn to be 0 and apply normalization to make the sum of theweights to be 1 Then apply the multinomial resampling algorithmof particle methods and estimate mQ as Res-KH

615 Demonstration Before starting quantitative comparisons wedemonstrate how the above estimators work Figure 4 shows demonstra-tion results with A = 1 First note that for mP = sumn

i=1 wik(middot Xi) samplesassociated with large weights are located around the mean of P as thestandard deviation of P is relatively small σP = 01 Note also that some

414 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

of the weights are negative In this example the error of mP is very smallmP minus mP2

HX= 849e minus 10 while that of the estimate mQ given by woRes is

mQ minus mQ2HX

= 0125 This shows that even if mP minus mP2HX

is very small

the resulting mQ minus mQ2HX

may not be small as implied by theorem 1 andthe bound equation 55

We can observe the following First algorithm 2 successfully discardedsamples associated with very small weights Almost all the generatedsamples X1 Xn are located in [minus2σP 2σP] where σP is the standarddeviation of P The error is mP minus mP2

HX= 474e minus 5 which is greater

than mP minus mP2HX

This is due to the additional error caused by the re-sampling algorithm Note that the resulting estimate mQ is of the errormQ minus mQ2

HX= 000827 This is much smaller than the estimate mQ by

woRes showing the merit of the resampling algorithmRes-Trunc first truncated the negative weights in w1 wn Let us

see the region where the density of P is very smallmdashthe region outside[minus2σP 2σP] We can observe that the absolute values of weights are verysmall in this region Note that there exist positive and negative weightsThese weights maintain balance such that the amounts of positive and neg-ative values are almost the same Therefore the truncation of the negativeweights breaks this balance As a result the amount of the positive weightssurpasses the amount needed to represent the density of P This can be seenfrom the histogram for Res-Trunc some of the samples X1 Xn gener-ated by Res-Trunc are located in the region where the density of P is verysmall Thus the resulting error mP minus mP2

HX= 00538 is much larger than

that of Res-KH This demonstrates why the resampling algorithm of parti-cle methods is not appropriate for kernel mean embeddings as discussedin section 42

616 Effects of the Sum of Squared Weights The purpose here is to see howthe error mQ minus mQ2

HXchanges as we vary the quantity

sumni=1 w2

i (recall that

the bound equation 55 indicates that mQ minus mQ2HX

increases assumn

i=1 w2i

increases) To this end we made mP = sumni=1 wikX (middot Xi) for several values of

the regularization constant λ as described above For each λ we constructedmP and estimated mQ using each of the three estimators above We repeatedthis 20 times for each λ and averaged the values of mP minus mP2

HXsumn

i=1 w2i

and the errors mQ minus mQ2HX

by the three estimators Figure 5 shows theseresults where the both axes are in the log scale Here we used A = 5 forthe support of the uniform distribution7 The results are summarized asfollows

7This enables us to maintain the values for mP minus mP2HX

in almost the same amountwhile changing the values for

sumni=1 w2

i

Filtering with State-Observation Examples 415

Figure 5 Results of synthetic experiments for the sampling and resampling pro-cedure in section 61 Vertical axis errors in the squared RKHS norm Horizontalaxis values of

sumni=1 w2

i for different mP Black the error of mP (mP minus mP2HX

)Blue green and red the errors on mQ by woRes Res-KH and Res-Truncrespectively

bull The error of woRes (blue) increases proportionally to the amount ofsumni=1 w2

i This matches the bound equation 55bull The error of Res-KH is not affected by

sumni=1 w2

i Rather it changes inparallel with the error of mP This is explained by the discussions insection 52 on how our resampling algorithm improves the accuracyof the sampling procedure

bull Res-Trunc is worse than Res-KH especially for largesumn

i=1 w2i This

is also explained with the bound equation 58 Here mP is the onegiven by Res-Trunc so the error mP minus mPHX

can be large due tothe truncation of negative weights as shown in the demonstrationresults This makes the resulting error mQ minus mQHX

large

Note that mP and mQ are different kernel means so it can happen thatthe errors mQ minus mQHX

by Res-KH are less than mp minus mPHX as in

Figure 5

416 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

62 Filtering with Synthetic State-Space Models Here we applyKMCF to synthetic state-space models Comparisons were made with thefollowing methods

kNN-PF (Vlassis et al 2002) This method uses k-NN-based condi-tional density estimation (Stone 1977) for learning the observation modelFirst it estimates the conditional density of the inverse direction p(x|y)

from the training sample (XiYi) The learned conditional density is thenused as an alternative for the likelihood p(yt |xt ) this is a heuristic todeal with high-dimensional yt Then it applies particle filter (PF) basedon the approximated observation model and the given transition modelp(xt |xtminus1)

GP-PF (Ferris et al 2006) This method learns p(yt |xt ) from (XiYi)with gaussian process (GP) regression Then particle filter is applied basedon the learned observation model and the transition model We used theopen source code for GP-regression in this experiment so comparison incomputational time is omitted for this method8

KBR filter (Fukumizu et al 2011 2013) This method is also based onkernel mean embeddings as is KMCF It applies kernel Bayesrsquo rule (KBR) inposterior estimation using the joint sample (XiYi) This method assumesthat there also exist training samples for the transition model Thus inthe following experiments we additionally drew training samples for thetransition model Fukumizu et al (2011 2013) showed that this methodoutperforms extended and unscented Kalman filters when a state-spacemodel has strong nonlinearity (in that experiment these Kalman filterswere given the full knowledge of a state-space model) We use this methodas a baseline

We used state-space models defined in Table 2 where ldquoSSMrdquo stands forldquostate-space modelrdquo In Table 2 ut denotes a control input at time t vtand wt denotes independent gaussian noise vt wt sim N(0 1) Wt denotes10-dimensional gaussian noise Wt sim N(0 I10) We generated each controlut randomly from the gaussian distribution N(0 1)

The state and observation spaces for SSMs 1a 1b 2a 2b 4a 4b aredefined as X = Y = R for SSMs 3a 3b X = RY = R

10 The models inSSMs 1a 2a 3a 4a and SSMs 1b 2b 3b 4b with the same number (eg1a and 1b) are almost the same the difference is whether ut exists in thetransition model Prior distributions for the initial state x1 for SSMs 1a 1b2a 2b 3a 3b are defined as pinit = N(0 1(1 minus 092)) and those for 4a4b are defined as a uniform distribution on [minus3 3]

SSMs 1a and 1b are linear gaussian models SSMs 2a and 2b are the so-called stochastic volatility models Their transition models are the same asthose of SSMs 1a and 1b The observation model has strong nonlinearityand the noise wt is multiplicative SSMs 3a and 3b are almost the same as

8httpwwwgaussianprocessorggpmlcodematlabdoc

Filtering with State-Observation Examples 417

Table 2 State-Space Models for Synthetic Experiments

SSM Transition Model Observation Model

1a xt = 09xtminus1 + vt yt = xt + wt

1b xt = 09xtminus1 + 1radic2(ut + vt ) yt = xt + wt

2a xt = 09xtminus1 + vt yt = 05 exp(xt2)wt

2b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)wt

3a xt = 09xtminus1 + vt yt = 05 exp(xt2)Wt

3b xt = 09xtminus1 + 1radic2(ut + vt ) yt = 05 exp(xt2)Wt

4a at = xtminus1 + radic2vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

4b at = xtminus1 + ut + vt bt = xt + wt

xt =

at (if |at | le 3)

minus3 (otherwise)yt =

bt (if |bt | le 3)

bt minus 6bt|bt | (otherwise)

SSMs 2a and 2b The difference is that the observation yt is 10-dimensionalas Wt is 10-dimensional gaussian noise SSMs 4a and 4b are more complexthan the other models Both the transition and observation models havestrong nonlinearities states and observations located around the edges ofthe interval [minus3 3] may abruptly jump to distant places

For each model we generated the training samples (XiYi)ni=1 by sim-

ulating the model Test data (xt yt )Tt=1 were also generated by indepen-

dent simulation (recall that xt is hidden for each method) The length ofthe test sequence was set as T = 100 We fixed the number of particles inkNN-PF and GP-PF to 5000 in primary experiments we did not observeany improvements even when more particles were used For the same rea-son we fixed the size of transition examples for KBR filter to 1000 Eachmethod estimated the ground-truth states x1 xT by estimating the pos-terior means

intxt p(xt |y1t )dxt (t = 1 T ) The performance was evaluated

with RMSE (root mean squared errors) of the point estimates defined as

RMSE =radic

1T

sumTt=1(xt minus xt )

2 where xt is the point estimateFor KMCF and KBR filter we used gaussian kernels for each of X and Y

(and also for controls in KBR filter) We determined the hyperparameters ofeach method by two-fold cross-validation by dividing the training data intotwo sequences The hyperparameters in the GP-regressor for PF-GP wereoptimized by maximizing the marginal likelihood of the training data Toreduce the costs of the resampling step of KMCF we used the method dis-cussed in section 52 with = 50 We also used the low-rank approximationmethod (see algorithm 5) and the subsampling method (see algorithm 6)in appendix C to reduce the computational costs of KMCF Specifically we

418 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 6 RMSE of the synthetic experiments in section 62 The state-spacemodels of these figures have no control in their transition models

used r = 10 20 (rank of low-rank matrices) for algorithm 5 (described asKMCF-low10 and KMCF-low20 in the results below) r = 50 100 (numberof subsamples) for algorithm 6 (described as KMCF-sub50 and KMCF-sub100) We repeated experiments 20 times for each of different trainingsample size n

Figure 6 shows the results in RMSE for SSMs 1a 2a 3a 4a and Figure 7shows those for SSMs 1b 2b 3b 4b Figure 8 describes the results incomputational time for SSMs 1a and 1b the results for the other models aresimilar so we omit them We do not show the results of KMCF-low10 inFigures 6 and 7 since they were numerically unstable and gave very largeRMSEs

GP-PF performed the best for SSMs 1a and 1b This may be becausethese models fit the assumption of GP-regression as their noise is addi-tive gaussian For the other models however GP-PF performed poorly

Filtering with State-Observation Examples 419

Figure 7 RMSE of synthetic experiments in section 62 The state-space modelsof these figures include control ut in their transition models

Figure 8 Computation time of synthetic experiments in section 62 (Left) SSM1a (Right) SSM 1b

420 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

the observation models of these models have strong nonlinearities and thenoise is not additive gaussian For these models KMCF performed thebest or competitively with the other methods This indicates that KMCFsuccessfully exploits the state-observation examples (XiYi)n

i=1 in dealingwith the complicated observation models Recall that our focus has beenon situations where the relations between states and observations are socomplicated that the observation model is not known the results indicatethat KMCF is promising for such situations On the other hand the KBRfilter performed worse than KMCF for most of the models KBF filter alsouses kernel Bayesrsquo rule as KMCF The difference is that KMCF makes use ofthe transition models directly by sampling while the KBR filter must learnthe transition models from training data for state transitions This indicatesthat the incorporation of the knowledge expressed in the transition model isvery important for the filtering performance This can also be seen by com-paring Figures 6 and 7 The performance of the methods other than KBRfilter improved for SSMs 1b 2b 3b 4b compared to the performance forthe corresponding models in SSMs 1a 2a 3a 4a Recall that SSMs 1b2b 3b 4b include control ut in their transition models The informationof control input is helpful for filtering in general Thus the improvementssuggest that KMCF kNN-PF and GP-PF successfully incorporate the in-formation of controls they achieve this by sampling with p(xt |xtminus1 ut ) Onthe other hand KBF filter must learn the transition model p(xt |xtminus1 ut ) thiscan be harder than learning the transition model p(xt |xtminus1) which has nocontrol input

We next compare computation time (see Figure 8) KMCF was competi-tive or even slower than the KBR filter This is due to the resampling step inKMCF The speed-up methods (KMCF-low10 KMCF-low20 KMCF-sub50and KMCF-sub100) successfully reduced the costs of KMCF KMCF-low10and KMCF-low20 scaled linearly to the sample size n this matches the factthat algorithm 5 reduces the costs of kernel Bayesrsquo Rule to O(nr2) The costsof KMCF-sub50 and KMCF-sub100 remained almost the same over the dif-ferent sample sizes This is because they reduce the sample size itself fromn to r so the costs are reduced to O(r3) (see algorithm 6) KMCF-sub50 andKMCF-sub100 are competitive to kNN-PF which is fast as it only needskNN searches to deal with the training sample (XiYi)n

i=1 In Figures 6and 7 KMCF-low20 and KMCF-sub100 produced the results competitiveto KMCF for SSMs 1a 2a 4a 1b 2b 4b Thus for these models suchmethods reduce the computational costs of KMCF without losing muchaccuracy KMCF-sub50 was slightly worse than KMCF-100 This indicatesthat the number of subsamples cannot be reduced to this extent if we wishto maintain accuracy For SSMs 3a and 3b the performance of KMCF-low20and KMCF-sub100 was worse than KMCF in contrast to the performancefor the other models The difference of SSMs 3a and 3b from the other mod-els is that the observation space is 10-dimensional Y = R

10 This suggeststhat if the dimension is high r needs to be large to maintain accuracy (recall

Filtering with State-Observation Examples 421

that r is the rank of low-rank matrices in algorithm 5 and the number ofsubsamples in algorithm 6) This is also implied by the experiments in thesection 63

63 Vision-Based Mobile Robot Localization We applied KMCF to theproblem of vision-based mobile robot localization (Vlassis et al 2002 Wolfet al 2005 Quigley et al 2010) We consider a robot moving in a buildingThe robot takes images with its vision camera as it moves Thus the visionimages form a sequence of observations y1 yT in time series each ytis an image The robot does not know its positions in the building wedefine state xt as the robotrsquos position at time t The robot wishes to estimateits position xt from the sequence of its vision images y1 yt This canbe done by filtering that is by estimating the posteriors p(xt |y1 yt )

(t = 1 T ) This is the robot localization problem It is fundamental inrobotics as a basis for more involved applications such as navigation andreinforcement learning (Thrun Burgard amp Fox 2005)

The state-space model is defined as follows The observation modelp(yt |xt ) is the conditional distribution of images given position which isvery complicated and considered unknown We need to assume position-image examples (XiYi)n

i=1 these samples are given in the data setdescribed below The transition model p(xt |xtminus1) = p(xt |xtminus1 ut ) is the con-ditional distribution of the current position given the previous one Thisinvolves a control input ut that specifies the movement of the robot In thedata set we use the control is given as odometry measurements Thus wedefine p(xt |xtminus1 ut ) as the odometry motion model which is fairly standardin robotics (Thrun et al 2005) Specifically we used the algorithm describedin table 56 of Thrun et al (2005) with all of its parameters fixed to 01 Theprior pinit of the initial position x1 is defined as a uniform distribution overthe samples X1 Xn in (XiYi)n

i=1As a kernel kY for observations (images) we used the spatial pyramid

matching kernel of Lazebnik et al (2006) This is a positive-definite kerneldeveloped in the computer vision community and is also fairly standardSpecifically we set the parameters of this kernel as suggested in Lazebniket al (2006) This gives a 4200-dimensional histogram for each image Wedefined the kernel kX for states (positions) as gaussian Here the state spaceis the four-dimensional space X = R

4 two dimensions for location and therest for the orientation of the robot9

The data set we used is the COLD database (Pronobis amp Caputo 2009)which is publicly available Specifically we used the data set Freiburg PartA Path 1 cloudy This data set consists of three similar trajectories of arobot moving in a building each of which provides position-image pairs(xt yt )T

t=1 We used two trajectories for training and validation and the

9We projected the robotrsquos orientation in [0 2π ] onto the unit circle in R2

422 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

rest for test We made state-observation examples (XiYi)ni=1 by randomly

subsampling the pairs in the trajectory for training Note that the difficultyof localization may depend on the time interval (ie the interval between tand t minus 1 in sec) Therefore we made three test sets (and training samplesfor state transitions in KBR filter) with different time intervals 227 sec(T = 168) 454 sec (T = 84) and 681 sec (T = 56)

In these experiments we compared KMCF with three methods kNN-PFKBR filter and the naive method (NAI) defined below For KBR filter wealso defined the gaussian kernel on the control ut that is on the differenceof odometry measurements at time t minus 1 and t The naive method (NAI)estimates the state xt as a point Xj in the training set (XiYi) such that thecorresponding observation Yj is closest to the observation yt We performedthis as a baseline We also used the spatial pyramid matching kernel forthese methods (for kNN-PF and NAI as a similarity measure of the nearestneighbors search) We did not compare with GP-PF since it assumes thatobservations are real vectors and thus cannot be applied to this problemstraightforwardly We determined the hyperparameters in each methodby cross-validation To reduce the cost of the resampling step in KMCFwe used the method discussed in section 52 with = 100 The low-rankapproximation method (see algorithm 5) and the subsampling method (seealgorithm 6) were also applied to reduce the computational costs of KMCFSpecifically we set r = 50 100 for algorithm 5 (described as KMCF-low50and KMCF-low100 in the results below) and r = 150 300 for algorithm 6(KMCF-sub150 and KMCF-sub300)

Note that in this problem the posteriors p(xt |y1t ) can be highly multi-modal This is because similar images appear in distant locations Thereforethe posterior mean

intxt p(xt |y1t )dxt is not appropriate for point estimation

of the ground-truth position xt Thus for KMCF and KBR filter we em-ployed the heuristic for mode estimation explained in section 44 For kNN-PF we used a particle with maximum weight for the point estimationWe evaluated the performance of each method by RMSE of location esti-mates We ran each experiment 20 times for each training set of differentsize

631 Results First we demonstrate the behaviors of KMCF with this lo-calization problem Figures 9 and 10 show iterations of KMCF with n = 400applied to the test data with time interval 681 sec Figure 9 illustrates itera-tions that produced accurate estimates while Figure 10 describes situationswhere location estimation is difficult

Figures 11 and 12 show the results in RMSE and computational timerespectively For all the results KMCF and that with the computational re-duction methods (KMCF-low50 KMCF-low100 KMCF-sub150 and KMCF-sub300) performed better than KBR filter These results show the bene-fit of directly manipulating the transition models with sampling KMCF

Filtering with State-Observation Examples 423

Figure 9 Demonstration results Each column corresponds to one iteration ofKMCF (Top) Prediction step histogram of samples for prior (Middle) Cor-rection step weighted samples for posterior The blue and red stems indi-cate positive and negative weights respectively The yellow ball representsthe ground-truth location xt and the green diamond the estimated one xt (Bottom) Resampling step histogram of samples given by the resampling step

was competitive with kNN-PF for the interval 227 sec note that kNN-PFwas originally proposed for the robot localization problem For the resultswith the longer time intervals (454 sec and 681 sec) KMCF outperformedkNN-PF

We next investigate the effect on KMCF of the methods to reduce com-putational cost The performance of KMCF-low100 and KMCF-sub300 iscompetitive with KMCF those of KMCF-low50 and KMCF-sub150 de-grade as the sample size increases Note that r = 50 100 for algorithm 5are larger than those in section 62 though the values of the sample size nare larger than those in section 62 Also note that the performance of KMCF-sub150 is much worse than KMCF-sub300 These results indicate that wemay need large values for r to maintain the accuracy for this localizationproblem Recall that the spatial pyramid matching kernel gives essentially ahigh-dimensional feature vector (histogram) for each observation Thus the

424 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 10 Demonstration results (see also the caption of Figure 9) Here weshow time points where observed images are similar to those in distant placesSuch a situation often occurs at corners and makes location estimation difficult(a) The prior estimate is reasonable but the resulting posterior has modes indistant places This makes the location estimate (green diamond) far from thetrue location (yellow ball) (b) While the location estimate is very accuratemodes also appear at distant locations

observation space Y may be considered high-dimensional This supportsthe hypothesis in section 62 that if the dimension is high the computationalcost-reduction methods may require larger r to maintain accuracy

Finally let us look at the results in computation time (see Figure 12)The results are similar to those in section 62 Although the values for r arerelatively large algorithms 5 and 6 successfully reduced the computationalcosts of KMCF

7 Conclusion and Future Work

This letter has proposed kernel Monte Carlo filter a novel filtering methodfor state-space models We have considered the situation where the obser-vation model is not known explicitly or even parametrically and where

Filtering with State-Observation Examples 425

Figure 11 RMSE of the robot localization experiments in section 63 Panels a band c show the cases for time interval 227 sec 454 sec and 681 sec respectively

examples of the state-observation relation are given instead of the obser-vation model Our approach is based on the framework of kernel meanembeddings which enables us to deal with the observation model in adata-driven manner The resulting filtering method consists of the predic-tion correction and resampling steps all realized in terms of kernel meanembeddings Methodological novelties lie in the prediction and resamplingsteps Thus we analyzed their behaviors by deriving error bounds for theestimator of the prediction step The analysis revealed that the effectivesample size of a weighted sample plays an important role as in parti-cle methods This analysis also explained how our resampling algorithmworks We applied the proposed method to synthetic and real problemsconfirming the effectiveness of our approach

426 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Figure 12 Computation time of the localization experiments in section 63Panels a b and c show the cases for time interval 227 sec 454 sec and 681 secrespectively Note that the results show the runtime of each method

One interesting topic for future research would be parameter estimationfor the transition model We did not discuss this and assumed that param-eters if they exist are given and fixed If the state observation examples(XiYi)n

i=1 are given as a sequence from the state-space model then wecan use the state samples X1 Xn for estimating those parameters Oth-erwise we need to estimate the parameters based on test data This mightbe possible by exploiting approaches for parameter estimation in particlemethods (eg section IV in Cappe et al (2007))

Another important topic is the situation where the observation modelin the test and training phases is different As discussed in section 43 this

Filtering with State-Observation Examples 427

might be addressed by exploiting the framework of transfer learning (Panamp Yang 2010) This would require extension of kernel mean embeddingsto the setting of transfer learning since there has been no work in thisdirection We consider that such extension is interesting in its own right

Appendix A Proof of Theorem 1

Before going to the proof we review some basic facts that are neededLet mP = int

kX (middot x)dP(x) and mP = sumni=1 wikX (middot Xi) By the reproducing

property of the kernel kX the following hold for any f isin HX

〈mP f 〉HX=

langintkX (middot x)dP(x) f

rangHX

=int

〈kX (middot x) f 〉HXdP(x)

=int

f (x)dP(x) = EXsimP[ f (X)] (A1)

〈mP f 〉HX=

langnsum

i=1

wikX (middot Xi) f

rangHX

=nsum

i=1

wi f (Xi) (A2)

For any f g isin HX we denote by f otimes g isin HX otimes HX the tensor productof f and g defined as

f otimes g(x1 x2) = f (x1)g(x2) forallx1 x2 isin X (A3)

The inner product of the tensor RKHS HX otimes HX satisfies

〈 f1 otimesg1 f2 otimes g2〉HXotimesHX= 〈 f1 f2〉HX

〈g1 g2〉HXforall f1 f2 g1 g2 isinHX

(A4)

Let φiIs=1 sub HX be complete orthonormal bases of HX where I isin N cup infin

Assume θ isin HX otimes HX (recall that this is an assumption of theorem 1) Thenθ is expressed as

θ =Isum

st=1

αstφs otimes φt (A5)

withsum

st |αst |2 lt infin (see Aronszajn 1950)

Proof of Theorem 1 Recall that mQ = sumni=1 wikX (middot X prime

i ) where X primei sim

p(middot|Xi) (i = 1 n) Then

EX prime1X

primen[mQ minus mQ2

HX]

428 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

= EX prime1X

primen[〈mQ mQ〉HX

minus 2〈mQ mQ〉HX+ 〈mQ mQ〉HX

]

=nsum

i j=1

wiw jEX primei X

primej[kX (X prime

i X primej)]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)]

=sumi = j

wiw jEX primei X

primej[kX (X prime

i X primej)] +

nsumi=1

w2i EX prime

i[kX (X prime

i X primei )]

minus 2nsum

i=1

wiEX primesimQX primei[kX (X prime X prime

i )] + EX primeX primesimQ[kX (X prime X prime)] (A6)

where X prime denotes an independent copy of X primeRecall that Q= int

p(middot|x)dP(x) and θ (x x) = int intkX (xprime xprime)dp(xprime|x)dp(xprime|x)

We can then rewrite terms in equation A6 as

EX primesimQX primei[kX (X prime X prime

i )]

=int (intint

kX (xprime xprimei)dp(xprime|x)dp(xprime

i|Xi)

)dP(x)

=int

θ (x Xi)dP(x) = EXsimP[θ (X Xi)]

EX primeX primesimQ[kX (X prime X prime)]

=intint (intint

kX (xprime xprime)dp(xprime|x)p(xprime|x)

)dP(x)dP(x)

=intint

θ (x x)dP(x)dP(x) = EXXsimP[θ (X X)]

Thus equation A6 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) +

nsumi j=1

wiw jθ (Xi Xj)

minus 2nsum

i=1

wiEXsimP[θ (X Xi)] + EXXsimP[θ (X X)] (A7)

Filtering with State-Observation Examples 429

We can rewrite terms in equation A7 as follows using the facts A1 A2A3 A4 and A5

sumi j

wiw jθ (Xi Xj) =sumi j

wiw j

sumst

αstφs(Xi)φt (Xj)

=sumst

αst

sumi

wiφs(Xi)sum

j

w jφt(Xj) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

sumi

wiEXsimP[θ (X Xi)] =sum

i

wiEXsimP

[sumst

αstφs(X)φt (Xi)

]

=sumst

αstEXsimP[φs(X)]sum

i

wiφt (Xi) =sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX= 〈mP otimes mP θ〉HX otimesHX

EXXsimP[θ (X X)] = EXXsimP

[sumst

αstφs(X)φt (X)

]

=sumst

αst〈mP φs〉HX〈mP φt〉HX

=sumst

αst〈mP otimes mP φs otimes φt〉HX otimesHX

= 〈mP otimes mP θ〉HX otimesHX

Thus equation A7 is equal to

nsumi=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )]) + 〈mP otimes mP θ〉HX otimesHX

minus 2〈mP otimes mP θ〉HX otimesHX+ 〈mP otimes mP θ〉HX otimesHX

=nsum

i=1

w2i (EX prime

i[kX (X prime

i X primei )] minus EX prime

i Xprimei[kX (X prime

i X primei )])

+〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHX

Finally the Cauchy-Schwartz inequality gives

〈(mP minus mP) otimes (mP minus mP) θ〉HX otimesHXle mP minus mP2

HXθHX otimesHX

This completes the proof

430 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Appendix B Proof of Theorem 2

Theorem 2 provides convergence rates for the resampling algorithmAlgorithm 4 This theorem assumes that the candidate samples Z1 ZNfor resampling are iid with a density q Here we prove theorem 2 byshowing that the same statement holds under weaker assumptions (seetheorem 3 below)

We first describe assumptions Let P be the distribution of the kernelmean mP and L2(P) be the Hilbert space of square-integrable functions onX with respect to P For any f isin L2(P) we write its norm by fL2(P) =int

f 2(x)dP(x)

Assumption 1 The candidate samples Z1 ZN are independent Thereare probability distributions Q1 QN on X such that for any boundedmeasurable function g X rarr R we have

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = EXsimQi

[g(X)] (i = 1 N) (B1)

Assumption 2 The distributions Q1 QN have density functions q1

qN respectively Define Q = 1N

sumNi=1 Qi and q = 1

N

sumNi=1 qi There is a con-

stant A gt 0 that does not depend on N such that

∥∥∥∥qi

qminus 1

∥∥∥∥2

L2(P)

le AradicN

(i = 1 N) (B2)

Assumption 3 The distribution P has a density function p such thatsupxisinX

p(x)

q(x)lt infin There is a constant σ gt 0 such that

radicN

(1N

Nsumi=1

p(Zi)

q(Zi)minus 1

)DrarrN (0 σ 2) (B3)

whereDrarr denotes convergence in distribution and N (0 σ 2) the normal

distribution with mean 0 and variance σ 2

These assumptions are weaker than those in theorem 2 which requirethat Z1 ZN be iid For example assumption 1 is clearly satisfied for theiid case since in this case we have Q = Q1= middot middot middot = QN The inequalityequation B2 in assumption 2 requires that the distributions Q1 QN getsimilar as the sample size increases This is also satisfied under the iid

Filtering with State-Observation Examples 431

assumption Likewise the convergence equation B3 in assumption 3 issatisfied from the central limit theorem if Z1 ZN are iid

We will need the following lemma

Lemma 1 Let Z1 ZN be samples satisfying assumption 1 Then the followingholds for any bounded measurable function g X rarr R

E

[1N

Nsumi=1

g(Zi )

]=

intg(x)d Q(x)

Proof

E

[1N

Nsumi=1

g(Zi)

]= E

⎡⎣ 1

N(N minus 1)

Nsumi=1

sumj =i

g(Zj)

⎤⎦

= 1N

Nsumi=1

E

⎡⎣ 1

N minus 1

sumj =i

g(Zj)

⎤⎦ = 1

N

Nsumi=1

intg(x)Qi(x) =

intg(x)dQ(x)

The following theorem shows the convergence rates of our resampling al-gorithm Note that it does not assume that the candidate samples Z1 ZNare identical to those expressing the estimator mP

Theorem 3 Let k be a bounded positive definite kernel and H be the asso-ciated RKHS Let Z1 ZN be candidate samples satisfying assumptions 12 and 3 Let P be a probability distribution satisfying assumption 3 and letmP =

intk(middot x)d P(x) be the kernel mean Let mP isin H be any element in H Sup-

pose we apply algorithm 4 to mP isin H with candidate samples Z1 ZN and letX1 X isin Z1 ZN be the resulting samples Then the following holds

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi )

∥∥∥∥∥2

H

= (mP minus mPH + Op(Nminus12))2 + O(

ln

)

Proof Our proof is based on the fact (Bach et al 2012) that kernel herdingcan be seen as the Frank-Wolfe optimization method with step size 1( + 1)

for the th iteration For details of the Frank-Wolfe method we refer to Jaggi(2013) and Freund and Grigas (2014)

Fix the samples Z1 ZN Let MN be the convex hull of the set k(middot Z1)

k(middot ZN ) sub H Define a loss function J H rarr R by

J(g) = 12g minus mP2

H g isin H (B4)

432 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Then algorithm 4 can be seen as the Frank-Wolfe method that iterativelyminimizes this loss function over the convex hull MN

infgisinMN

J(g)

More precisely the Frank-Wolfe method solves this problem by the follow-ing iterations

s = arg mingisinMN

〈gnablaJ(gminus1)〉H

g = (1 minus γ )gminus1 + γ s ( ge 1)

where γ is a step size defined as γ = 1 and nablaJ(gminus1) is the gradient of J atgminus1 nablaJ(gminus1) = gminus1 minus mP Here the initial point is defined as g0 = 0 It canbe easily shown that g = 1

sumi=1 k(middot Xi) where X1 X are the samples

given by algorithm 4 (For details see Bach et al 2012)Let LJMN

gt 0 be the Lipschitz constant of the gradient nablaJ over MN andDiam MN gt 0 be the diameter of MN

LJMN= sup

g1g2isinMN

nablaJ(g1) minus nablaJ(g2)Hg1 minus g2H

= supg1g2isinMN

g1 minus g2Hg1 minus g2H

= 1 (B5)

Diam MN = supg1g2isinMN

g1 minus g2H

le supg1g2isinMN

g1H + g2H le 2C (B6)

where C = supxisinX k(middot x)H = supxisinXradic

k(x x) lt infinFrom bound 32 and equation 8 of Freund and Grigas (2014) we then

have

J(g) minus infgisinMN

J(g) leLJMN

(Diam MN)2(1 + ln )

2(B7)

le 2C2(1 + ln )

(B8)

where the last inequality follows from equations B5 and B6

Filtering with State-Observation Examples 433

Note that the upper bound of equation B8 does not depend on thecandidate samples Z1 ZN Hence combined with equation B4 thefollowing holds for any choice of Z1 ZN

∥∥∥∥∥mP minus 1

sumi=1

k(middot Xi)

∥∥∥∥∥2

H

le infgisinMN

mP minus g2H + 4C2(1 + ln )

(B9)

Below we will focus on bounding the first term of equation B9 Recallhere that Z1 ZN are random samples Define a random variable SN =sumN

i=1p(Zi )

q(Zi ) Since MN is the convex hull of the k(middot Z1) k(middot ZN ) we

have

infgisinMN

mP minus gH

= infαisinRN αge0

sumi αile1

∥∥∥∥∥mP minussum

i

αik(middot Zi)

∥∥∥∥∥H

le∥∥∥∥∥mP minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

Therefore we have

mP minus 1

sumi=1

k(middot Xi)2H

le(

mP minus mPH +∥∥∥∥∥mP minus 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

+∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

)2

+ O(

ln

)

(B10)

Below we derive rates of convergence for the second and third terms

434 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

For the second term we derive a rate of convergence in expectationwhich implies a rate of convergence in probability To this end we use thefollowing fact Let f isin H be any function in the RKHS By the assumptionsupxisinX

p(x)

q(x)lt infin and the boundedness of k functions x rarr p(x)

q(x)f (x) and

x rarr (p(x)

q(x))2 f (x) are bounded

E

⎡⎣

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

⎤⎦

= mP2H minus 2E

[1N

sumi

p(Zi)

q(Zi)mP(Zi)

]

+ E

⎡⎣ 1

N2

sumi

sumj

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

= mP2H minus 2

intp(x)

q(x)mP(x)q(x)dx

+ E

⎡⎣ 1

N2

sumi

sumj =i

p(Zi)

q(Zi)

p(Zj)

q(Zj)k(Zi Zj)

⎤⎦

+E

[1

N2

sumi

(p(Zi)

q(Zi)

)2

k(Zi Zi)

]

= mP2H minus 2mP2

H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

int (p(x)

q(x)

)2

k(x x)q(x)dx

= minusmP2H + E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

+ 1N

intp(x)

q(x)k(x x)dP(x)

We further rewrite the second term of the last equality as follows

E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)qi(x)dx

]

Filtering with State-Observation Examples 435

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)(qi(x) minus q(x))dx

]

+ E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

intp(x)

q(x)k(Zi x)q(x)dx

]

= E

[N minus 1

N2

sumi

p(Zi)

q(Zi)

int radicp(x)k(Zi x)

radicp(x)

(qi(x)

q(x)minus 1

)dx

]

+ N minus 1N

mP2H

le E

[N minus 1

N2

sumi

p(Zi)

q(Zi)k(Zi middot)L2(P)

∥∥∥∥qi(x)

q(x)minus 1

∥∥∥∥L2(P)

]+ N minus 1

NmP2

H

le E

[N minus 1

N3

sumi

p(Zi)

q(Zi)C2A

]+ N minus 1

NmP2

H

= C2A(N minus 1)

N2 + N minus 1N

mP2H

where the first inequality follows from Cauchy-Schwartz Using this weobtain

E[

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥2

H

le 1N

(intp(x)

q(x)k(x x)dP(x) minus mP2

H

)+ C2(N minus 1)A

N2

= O(Nminus1)

Therefore we have

∥∥∥∥∥mP minus 1N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (N rarr infin) (B11)

We can bound the third term as follows∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

=∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

(1 minus N

SN

)∥∥∥∥∥H

436 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

=∣∣∣∣1 minus N

SN

∣∣∣∣∥∥∥∥∥ 1

N

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

le∣∣∣∣1 minus N

SN

∣∣∣∣C pqinfin

=∣∣∣∣∣1 minus 1

1N

sumNi=1 p(Zi)q(Zi)

∣∣∣∣∣C pqinfin

where pqinfin = supxisinXp(x)

q(x)lt infin Therefore the following holds by as-

sumption 3 and the delta method

∥∥∥∥∥ 1N

sumi

p(Zi)

q(Zi)k(middot Zi) minus 1

SN

sumi

p(Zi)

q(Zi)k(middot Zi)

∥∥∥∥∥H

= Op(Nminus12) (B12)

The assertion of the theorem follows from equations B10 to B12

Appendix C Reduction of Computational Cost

We have seen in section 43 that the time complexity of KMCF in one timestep is O(n3) where n is the number of the state-observation examples(XiYi)n

i=1 This can be costly if one wishes to use KMCF in real-timeapplications with a large number of samples Here we show two methodsfor reducing the costs one based on low-rank approximation of kernelmatrices and one based on kernel herding Note that kernel herding is alsoused in the resampling step The purpose here is different however wemake use of kernel herding for finding a reduced representation of the data(XiYi)n

i=1

C1 Low-rank Approximation of Kernel Matrices Our goal is to re-duce the costs of algorithm 1 of kernel Bayesrsquo rule Algorithm 1 involvestwo matrix inversions (GX + nεIn)minus1 in line 3 and ((GY )2 + δIn)minus1 in line4 Note that (GX + nεIn)minus1 does not involve the test data so it can be com-puted before the test phase On the other hand ((GY )2 + δIn)minus1 dependson matrix This matrix involves the vector mπ which essentially repre-sents the prior of the current state (see line 13 of algorithm 3) Therefore((GY )2 + δIn)minus1 needs to be computed for each iteration in the test phaseThis has the complexity of O(n3) Note that even if (GX + nεIn)minus1 can becomputed in the training phase the multiplication (GX + nεIn)minus1mπ in line3 requires O(n2) Thus it can also be costly Here we consider methods toreduce both costs in lines 3 and 4

Suppose that there exist low-rank matrices UV isin Rntimesr where r lt n that

approximate the kernel matrices GX asymp UUT GY asymp VVT Such low-rank

Filtering with State-Observation Examples 437

matrices can be obtained by for example incomplete Cholesky decomposi-tion with time complexity O(nr2) (Fine amp Scheinberg 2001 Bach amp Jordan2002) Note that the computation of these matrices is required only oncebefore the test phase Therefore their time complexities are not the problemhere

C11 Derivation First we approximate (GX + nεIn)minus1mπ in line 3 usingGX asymp UUT By the Woodbury identity we have

(GX + nεIn)minus1mπ asymp (UUT + nεIn)minus1mπ

= 1nε

(In minus U(nεIr + UTU)minus1UT )mπ

where Ir isin Rrtimesr denotes the identity Note that (nεIr + UTU)minus1 does not

involve the test data so can be computed in the training phase Thus theabove approximation of μ can be computed with complexity O(nr2)

Next we approximate w = GY ((GY )2 + δI)minus1kY in line 4 using GY asympVVT Define B = V isin R

ntimesr C = VTV isin Rrtimesr and D = VT isin R

rtimesn Then(GY )2 asymp (VVT )2 = BCD By the Woodbury identity we obtain

(δIn + (GY )2)minus1 asymp (δIn + BCD)minus1

= 1δ(In minus B(δCminus1 + DB)minus1D)

Thus w can be approximated as

w =GY ((GY )2 + δI)minus1kY

asymp 1δVVT (In minus B(δCminus1 + DB)minus1D)kY

The computation of this approximation requires O(nr2 + r3) = O(nr2) Thusin total the complexity of algorithm 1 can be reduced to O(nr2) We sum-marize the above approximations in algorithm 5

438 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

C12 How to Use Algorithm 5 can be used with algorithm 3 of KMCFby modifying Algorithm 3 in the following manner Compute the low-rank matrices UV right after lines 4 and 5 This can be done by using forexample incomplete Cholesky decomposition (Fine amp Scheinberg 2001Bach amp Jordan 2002) Then replace algorithm 1 in line 15 by algorithm 5

C13 How to Select the Rank As discussed in section 43 one way ofselecting the rank r is to use-cross validation by regarding r as a hyper-parameter of KMCF Another way is to measure the approximation errorsGX minus UUT and GY minus VVT with some matrix norm such as the Frobe-nius norm Indeed we can compute the smallest rank r such that theseerrors are below a prespecified threshold and this can be done efficientlywith time complexity O(nr2) (Bach amp Jordan 2002)

C2 Data Reduction with Kernel Herding Here we describe an ap-proach to reduce the size of the representation of the state-observationexamples (XiYi)n

i=1 in an efficient way By ldquoefficientrdquo we mean that theinformation contained in (XiYi)n

i=1 will be preserved even after the re-duction Recall that (XiYi)n

i=1 contains the information of the observationmodel p(yt |xt ) (recall also that p(yt |xt ) is assumed time-homogeneous seesection 41) This information is used in only algorithm 1 of kernel Bayesrsquorule (line 15 algorithm 3) Therefore it suffices to consider how kernel Bayesrsquorule accesses the information contained in the joint sample (XiYi)n

i=1

C21 Representation of the Joint Sample To this end we need to show howthe joint sample (XiYi)n

i=1 can be represented with a kernel mean embed-ding Recall that (kX HX ) and (kY HY ) are kernels and the associatedRKHSs on the state-space X and the observation-space Y respectively LetX times Y be the product space ofX andY Then we can define a kernel kXtimesY onX times Y as the product of kX and kY kXtimesY ((x y) (xprime yprime)) = kX (x xprime)kY (y yprime)for all (x y) (xprime yprime) isin X times Y This product kernel kXtimesY defines an RKHS ofX times Y Let HXtimesY denote this RKHS As in section 3 we can use kXtimesY andHXtimesY for a kernel mean embedding In particular the empirical distribution1n

sumni=1 δ(XiYi )

of the joint sample (XiYi)ni=1 sub X times Y can be represented as

an empirical kernel mean in HXtimesY

mXY = 1n

nsumi=1

kXtimesY ((middot middot) (XiYi)) isin HXtimesY (C1)

This is the representation of the joint sample (XiYi)ni=1

The information of (XiYi)ni=1 is provided for kernel Bayesrsquo rule essen-

tially through the form of equation C1 (Fukumizu et al 2011 2013) Recallthat equation C1 is a point in the RKHS HXtimesY Any point close to thisequation in HXtimesY would also contain information close to that contained in

Filtering with State-Observation Examples 439

equation C1 Therefore we propose to find a subset (X1 Y1) (Xr Xr) sub(XiYi)n

i=1 where r lt n such that its representation in HXtimesY

mXY = 1r

rsumi=1

kXtimesY ((middot middot) (Xi Yi)) isin HXtimesY (C2)

is close to equation C1 Namely we wish to find subsamples such thatmXY minus mXYHXtimesY

is small If the error mXY minus mXYHXtimesYis small enough

equation C2 would provide information close to that given by equation C1for kernel Bayesrsquo rule Thus kernel Bayesrsquo rule based on such subsamples(Xi Yi)r

i=1 would not perform much worse than the one based on the entireset of samples (XiYi)n

i=1

C22 Subsampling Method To find such subsamples we make use ofkernel herding in section 35 Namely we apply the update equations 36and 37 to approximate equation C1 with kernel kXtimesY and RKHS HXtimesY We greedily find subsamples Dr = (X1 Y1) (Xr Yr) as

(Xr Yr)

= arg max(xy)isinDDrminus1

1n

nsumi=1

kXtimesY ((x y) (XiYi))minus1r

rminus1sumj=1

kXtimesY ((x r) (Xi Yi))

= arg max(xy)isinDDrminus1

1n

nsumi=1

kX (x Xi)kY (yYi)minus1r

rminus1sumj=1

kX (x Xj)kY (y Yj)

The resulting algorithm is shown in algorithm 6 The time complexity isO(n2r) for selecting r subsamples

C23 How to Use By using algorithm 6 we can reduce the the time com-plexity of KMCF (see algorithm 3) in each iteration from O(n3) to O(r3) Thiscan be done by obtaining subsamples (Xi Yi)r

i=1 by applying algorithm 6to (XiYi)n

i=1 and then replacing (XiYi)ni=1 in requirement of algorithm 3

by (Xi Yi)ri=1 and using the number r instead of n

C24 How to Select the Number of Subsamples The number r of subsam-ples determines the trade-off between the accuracy and computational timeof KMCF It may be selected by cross-validation or by measuring the ap-proximation error mXY minus mXYHXtimesY

as for the case of selecting the rankof low-rank approximation in section C1

C25 Discussion Recall that kernel herding generates samples such thatthey approximate a given kernel mean (see section 35) Under certain

440 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

assumptions the error of this approximation is of O(rminus1) with r sampleswhich is faster than that of iid samples O(rminus12) This indicates that sub-samples (Xi Yi)r

i=1 selected with kernel herding may approximate equa-tion C1 well Here however we find the solutions of the optimizationproblems 36 and 37 from the finite set (XiYi)n

i=1 rather than the entirejoint space X times Y The convergence guarantee is provided only for the caseof the entire joint space X times Y Thus for our case the convergence guar-antee is no longer provided Moreover the fast rate O(rminus1) is guaranteedonly for finite-dimensional RKHSs Gaussian kernels which we often use inpractice define infinite-dimensional RKHSs Therefore the fast rate is notguaranteed if we use gaussian kernels Nevertheless we can use algorithm6 as a heuristic for data reduction

Acknowledgments

We express our gratitude to the associate editor and the anonymous re-viewer for their time and helpful suggestions We also thank MasashiShimbo Momoko Hayamizu Yoshimasa Uematsu and Katsuhiro Omaefor their helpful comments This work has been supported in part by MEXTGrant-in-Aid for Scientific Research on Innovative Areas 25120012 MKhas been supported by JSPS Grant-in-Aid for JSPS Fellows 15J04406

References

Anderson B amp Moore J (1979) Optimal filtering Englewood Cliffs NJ PrenticeHall

Aronszajn N (1950) Theory of reproducing kernels Transactions of the AmericanMathematical Society 68(3) 337ndash404

Filtering with State-Observation Examples 441

Bach F amp Jordan M I (2002) Kernel independent component analysis Journal ofMachine Learning Research 3 1ndash48

Bach F Lacoste-Julien S amp Obozinski G (2012) On the equivalence betweenherding and conditional gradient algorithms In Proceedings of the 29th Interna-tional Conference on Machine Learning (ICML2012) (pp 1359ndash1366) Madison WIOmnipress

Berlinet A amp Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probabilityand statistics Dordrecht Kluwer Academic

Calvet L E amp Czellar V (2015) Accurate methods for approximate Bayesiancomputation filtering Journal of Financial Econometrics 13 798ndash838 doi101093jjfinecnbu019

Cappe O Godsill S J amp Moulines E (2007) An overview of existing methods andrecent advances in sequential Monte Carlo IEEE Proceedings 95(5) 899ndash924

Chen Y Welling M amp Smola A (2010) Supersamples from kernel-herding InProceedings of the 26th Conference on Uncertainty in Artificial Intelligence (pp 109ndash116) Cambridge MA MIT Press

Deisenroth M Huber M amp Hanebeck U (2009) Analytic moment-based gaussianprocess filtering In Proceedings of the 26th International Conference on MachineLearning (pp 225ndash232) Madison WI Omnipress

Doucet A Freitas N D amp Gordon N J (Eds) (2001) Sequential Monte Carlomethods in practice New York Springer

Doucet A amp Johansen A M (2011) A tutorial on particle filtering and smoothingFifteen years later In D Crisan amp B Rozovskii (Eds) The Oxford handbook ofnonlinear filtering (pp 656ndash704) New York Oxford University Press

Durbin J amp Koopman S J (2012) Time series analysis by state space methods (2nd ed)New York Oxford University Press

Eberts M amp Steinwart I (2013) Optimal regression rates for SVMs using gaussiankernels Electronic Journal of Statistics 7 1ndash42

Ferris B Hahnel D amp Fox D (2006) Gaussian processes for signal strength-basedlocation estimation In Proceedings of Robotics Science and Systems CambridgeMA MIT Press

Fine S amp Scheinberg K (2001) Efficient SVM training using low-rank kernel rep-resentations Journal of Machine Learning Research 2 243ndash264

Freund R M amp Grigas P (2014) New analysis and results for the FrankndashWolfemethod Mathematical Programming doi 101007s10107-014-0841-6

Fukumizu K Bach F amp Jordan M I (2004) Dimensionality reduction for super-vised learning with reproducing kernel Hilbert spaces Journal of Machine LearningResearch 5 73ndash99

Fukumizu K Gretton A Sun X amp Scholkopf B (2008) Kernel measures ofconditional dependence In J C Platt D Koller Y Singer amp S Roweis (Eds)Advances in neural information processing systems 20 (pp 489ndash496) CambridgeMA MIT Press

Fukumizu K Song L amp Gretton A (2011) Kernel Bayesrsquo rule In J Shawe-TaylorR S Zemel P L Bartlett F C N Pereira amp K Q Weinberger (Eds) Advances inneural information processing systems 24 (pp 1737ndash1745) Red Hook NY Curran

Fukumizu K Song L amp Gretton A (2013) Kernel Bayesrsquo rule Bayesian inferencewith positive definite kernels Journal of Machine Learning Research 14 3753ndash3783

442 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Fukumizu K Sriperumbudur B Gretton A amp Scholkopf B (2009) Characteristickernels on groups and semigroups In D Koller D Schuurmans Y Bengio amp LBottou (Eds) Advances in neural information processing systems 21 (pp 473ndash480)Cambridge MA MIT Press

Gordon N J Salmond D J amp Smith AFM (1993) Novel approach tononlinearnon-gaussian Bayesian state estimation IEE-Proceedings-F 140 107ndash113

Hofmann T Scholkopf B amp Smola A J (2008) Kernel methods in machine learn-ing Annals of Statistics 36(3) 1171ndash1220

Jaggi M (2013) Revisiting Frank-Wolfe Projection-free sparse convex optimizationIn Proceedings of the 30th International Conference on Machine Learning (pp 427ndash435)httplinkspringercomarticle101007Fg10107-014-0841-6

Jasra A Singh S S Martin J S amp McCoy E (2012) Filtering via approximateBayesian computation Statistics and Computing 22 1223ndash1237

Julier S J amp Uhlmann J K (1997) A new extension of the Kalman filter tononlinear systems In Proceedings of AeroSense The 11th International SymposiumAerospaceDefence Sensing Simulation and Controls Bellingham WA SPIE

Julier S J amp Uhlmann J K (2004) Unscented filtering and nonlinear estimationIEEE Review 92 401ndash422

Kalman R E (1960) A new approach to linear filtering and prediction problemsTransactions of the ASME Journal of Basic Engineering 82 35ndash45

Kanagawa M amp Fukumizu K (2014) Recovering distributions from gaussianRKHS embeddings In Proceedings of the 17th International Conference on Artifi-cial Intelligence and Statistics (pp 457ndash465) JMLR

Kanagawa M Nishiyama Y Gretton A amp Fukumizu K (2014) Monte Carlofiltering using kernel embedding of distributions In Proceedings of the 28thAAAI Conference on Artificial Intelligence (pp 1897ndash1903) Cambridge MA MITPress

Ko J amp Fox D (2009) GP-BayesFilters Bayesian filtering using gaussian processprediction and observation models Autonomous Robots 72(1) 75ndash90

Lazebnik S Schmid C amp Ponce J (2006) Beyond bags of features Spatial pyramidmatching for recognizing natural scene categories In Proceedings of 2006 IEEEComputer Society Conference on Computer Vision and Pattern Recognition (vol 2 pp2169ndash2178) Washington DC IEEE Computer Society

Liu J S (2001) Monte Carlo strategies in scientific computing New York Springer-Verlag

McCalman L OrsquoCallaghan S amp Ramos F (2013) Multi-modal estimation withkernel embeddings for learning motion models In Proceedings of 2013 IEEE In-ternational Conference on Robotics and Automation (pp 2845ndash2852) Piscataway NJIEEE

Pan S J amp Yang Q (2010) A survey on transfer learning IEEE Transactions onKnowledge and Data Engineering 22(10) 1345ndash1359

Pistohl T Ball T Schulze-Bonhage A Aertsen A amp Mehring C (2008) Predic-tion of arm movement trajectories from ECoG-recordings in humans Journal ofNeuroscience Methods 167(1) 105ndash114

Pronobis A amp Caputo B (2009) COLD COsy localization database InternationalJournal of Robotics Research 28(5) 588ndash594

Filtering with State-Observation Examples 443

Quigley M Stavens D Coates A amp Thrun S (2010) Sub-meter indoor localiza-tion in unmodified environments with inexpensive sensors In Proceedings of theIEEERSJ International Conference on Intelligent Robots and Systems 2010 (vol 1 pp2039ndash2046) Piscataway NJ IEEE

Ristic B Arulampalam S amp Gordon N (2004) Beyond the Kalman filter Particlefilters for tracking applications Norwood MA Artech House

Schaid D J (2010a) Genomic similarity and kernel methods I Advancements bybuilding on mathematical and statistical foundations Human Heredity 70(2) 109ndash131

Schaid D J (2010b) Genomic similarity and kernel methods II Methods for genomicinformation Human Heredity 70(2) 132ndash140

Schalk G Kubanek J Miller K J Anderson N R Leuthardt E C Ojemann J G Wolpaw J R (2007) Decoding two dimensional movement trajectories usingelectrocorticographic signals in humans Journal of Neural Engineering 4(264)264ndash275

Scholkopf B amp Smola A J (2002) Learning with kernels Cambridge MA MIT PressScholkopf B Tsuda K amp Vert J P (2004) Kernel methods in computational biology

Cambridge MA MIT PressSilverman B W (1986) Density estimation for statistics and data analysis London

Chapman and HallSmola A Gretton A Song L amp Scholkopf B (2007) A Hilbert space embed-

ding for distributions In Proceedings of the International Conference on AlgorithmicLearning Theory (pp 13ndash31) New York Springer

Song L Fukumizu K amp Gretton A (2013) Kernel embeddings of conditional dis-tributions A unified kernel framework for nonparametric inference in graphicalmodels IEEE Signal Processing Magazine 30(4) 98ndash111

Song L Huang J Smola A amp Fukumizu K (2009) Hilbert space embeddings ofconditional distributions with applications to dynamical systems In Proceedingsof the 26th International Conference on Machine Learning (pp 961ndash968) MadisonWI Omnipress

Sriperumbudur B K Gretton A Fukumizu K Scholkopf B amp Lanckriet G R(2010) Hilbert space embeddings and metrics on probability measures Journal ofMachine Learning Research 11 1517ndash1561

Steinwart I amp Christmann A (2008) Support vector machines New York SpringerStone C J (1977) Consistent nonparametric regression Annals of Statistics 5(4)

595ndash620Thrun S Burgard W amp Fox D (2005) Probabilistic robotics Cambridge MA MIT

PressVlassis N Terwijn B amp Krose B (2002) Auxiliary particle filter robot local-

ization from high-dimensional sensor observations In Proceedings of the In-ternational Conference on Robotics and Automation (pp 7ndash12) Piscataway NJIEEE

Wang Z Ji Q Miller K J amp Schalk G (2011) Prior knowledge improves decodingof finger flexion from electrocorticographic signals Frontiers in Neuroscience 5127

Widom H (1963) Asymptotic behavior of the eigenvalues of certain integral equa-tions Transactions of the American Mathematical Society 109 278ndash295

444 M Kanagawa Y Nishiyama A Gretton and K Fukumizu

Widom H (1964) Asymptotic behavior of the eigenvalues of certain integral equa-tions II Archive for Rational Mechanics and Analysis 17 215ndash229

Wolf J Burgard W amp Burkhardt H (2005) Robust vision-based localization bycombining an image retrieval system with Monte Carlo localization IEEE Trans-actions on Robotics 21(2) 208ndash216

Received May 18 2015 accepted October 14 2015

Page 18: Filtering with State-Observation Examples via Kernel Monte ...
Page 19: Filtering with State-Observation Examples via Kernel Monte ...
Page 20: Filtering with State-Observation Examples via Kernel Monte ...
Page 21: Filtering with State-Observation Examples via Kernel Monte ...
Page 22: Filtering with State-Observation Examples via Kernel Monte ...
Page 23: Filtering with State-Observation Examples via Kernel Monte ...
Page 24: Filtering with State-Observation Examples via Kernel Monte ...
Page 25: Filtering with State-Observation Examples via Kernel Monte ...
Page 26: Filtering with State-Observation Examples via Kernel Monte ...
Page 27: Filtering with State-Observation Examples via Kernel Monte ...
Page 28: Filtering with State-Observation Examples via Kernel Monte ...
Page 29: Filtering with State-Observation Examples via Kernel Monte ...
Page 30: Filtering with State-Observation Examples via Kernel Monte ...
Page 31: Filtering with State-Observation Examples via Kernel Monte ...
Page 32: Filtering with State-Observation Examples via Kernel Monte ...
Page 33: Filtering with State-Observation Examples via Kernel Monte ...
Page 34: Filtering with State-Observation Examples via Kernel Monte ...
Page 35: Filtering with State-Observation Examples via Kernel Monte ...
Page 36: Filtering with State-Observation Examples via Kernel Monte ...
Page 37: Filtering with State-Observation Examples via Kernel Monte ...
Page 38: Filtering with State-Observation Examples via Kernel Monte ...
Page 39: Filtering with State-Observation Examples via Kernel Monte ...
Page 40: Filtering with State-Observation Examples via Kernel Monte ...
Page 41: Filtering with State-Observation Examples via Kernel Monte ...
Page 42: Filtering with State-Observation Examples via Kernel Monte ...
Page 43: Filtering with State-Observation Examples via Kernel Monte ...
Page 44: Filtering with State-Observation Examples via Kernel Monte ...
Page 45: Filtering with State-Observation Examples via Kernel Monte ...
Page 46: Filtering with State-Observation Examples via Kernel Monte ...
Page 47: Filtering with State-Observation Examples via Kernel Monte ...
Page 48: Filtering with State-Observation Examples via Kernel Monte ...
Page 49: Filtering with State-Observation Examples via Kernel Monte ...
Page 50: Filtering with State-Observation Examples via Kernel Monte ...
Page 51: Filtering with State-Observation Examples via Kernel Monte ...
Page 52: Filtering with State-Observation Examples via Kernel Monte ...
Page 53: Filtering with State-Observation Examples via Kernel Monte ...
Page 54: Filtering with State-Observation Examples via Kernel Monte ...
Page 55: Filtering with State-Observation Examples via Kernel Monte ...
Page 56: Filtering with State-Observation Examples via Kernel Monte ...
Page 57: Filtering with State-Observation Examples via Kernel Monte ...
Page 58: Filtering with State-Observation Examples via Kernel Monte ...
Page 59: Filtering with State-Observation Examples via Kernel Monte ...
Page 60: Filtering with State-Observation Examples via Kernel Monte ...
Page 61: Filtering with State-Observation Examples via Kernel Monte ...
Page 62: Filtering with State-Observation Examples via Kernel Monte ...
Page 63: Filtering with State-Observation Examples via Kernel Monte ...